Adversarial Domain Adaptation for Duplicate Question Detection

We address the problem of detecting duplicate questions in forums, which is an important step towards automating the process of answering new questions. As finding and annotating such potential duplicates manually is very tedious and costly, automatic methods based on machine learning are a viable alternative. However, many forums do not have annotated data, i.e., questions labeled by experts as duplicates, and thus a promising solution is to use domain adaptation from another forum that has such annotations. Here we focus on adversarial domain adaptation, deriving important findings about when it performs well and what properties of the domains are important in this regard. Our experiments with StackExchange data show an average improvement of 5.6% over the best baseline across multiple pairs of domains.


Introduction
Recent years have seen the rise of community question answering forums, which allow users to ask questions and to get answers in a collaborative fashion. One issue with such forums is that duplicate questions easily become ubiquitous as users often ask the same question, possibly in a slightly different formulation, making it difficult to find the best (or one correct) answer (Hoogeveen et al., 2018;Lai et al., 2018). Many forums allow users to signal such duplicates, but this can only be done after the duplicate question has already been posted and has possibly received some answers, which complicates merging the question threads. Discovering possible duplicates at the time of posting is much more valuable from the perspective of both (i) the forum, as it could prevent a duplicate from being posted, and (ii) the users, as they could get an answer immediately. * Work conducted while the author was at QCRI. Duplicate question detection is a special case of the more general problem of question-question similarity. The latter was addressed using a variety of textual similarity measures, topic modeling (Cao et al., 2008;Zhang et al., 2014), and syntactic structure (Wang et al., 2009;Filice et al., 2016;Da San Martino et al., 2016;Barrón-Cedeño et al., 2016;Filice et al., 2017). Another approach is to use neural networks such as feed-forward , convolutional (dos Santos et al., 2015;Bonadiman et al., 2017;, long short-term memory (Romeo et al., 2016), and more complex models (Lei et al., 2016;Nicosia and Moschitti, 2017;Uva et al., 2018;Joty et al., 2018;Zhang and Wu, 2018). Translation models have also been popular (Zhou et al., 2011;Jeon et al., 2005;Guzmán et al., 2016a,b).
The above work assumes labeled training data, which exists for question-question similarity, e.g., from SemEval-2016/2017 (Agirre et al., 2016;, and for duplicate question detection, e.g., SemEval-2017 task 3 featured four StackExchange forums, Android, English, Gaming, and Wordpress, from CQADup-Stack (Hoogeveen et al., 2015(Hoogeveen et al., , 2016. Yet, such annotation is not available for many other forums, e.g., the Apple community on StackExchange. In this paper, we address this lack of annotation using adversarial domain adaptation (ADA) to effectively use labeled data from another forum. Our contributions can be summarized as follows: • we are the first to apply adversarial domain adaptation to the problem of duplicate question detection across different domains; 1 • on the StackExchange family of forums, our model outperforms the best baseline with an average relative improvement of 5.6% (up to 14%) across all domain pairs. • we study when transfer learning performs well and what properties of the domains are important in this regard; and • we show that adversarial domain adaptation can be efficient even for unseen target domains, given some similarity of the target domain with the source one and with the regularizing adversarial domain.
Adversarial domain adaptation (ADA) was proposed by Ganin and Lempitsky (2015), and was then used for NLP tasks such as sentiment analysis and retrieval-based question answering (Chen et al., 2016;Ganin et al., 2016;Li et al., 2017;Liu et al., 2017;Yu et al., 2018;, including cross-language adaptation (Joty et al., 2017) for question-question similarity. 2 The rest of this paper is organized as follows: Section 2 presents our model, its components, and the training procedure. Section 3 describes the datasets we used for our experiments, stressing upon their nature and diversity. Section 4 describes our adaptation experiments and discusses the results. Finally, Section 5 concludes with possible directions for future work.

Method
Our ADA model has three components: (i) question encoder, (ii) similarity function, and (iii) domain adaptation component, as shown in Figure 1.
The encoder E maps a sequence of word tokens x = (x 1 , .., x n ) to a dense vector v = E(x). The similarity function f takes two question vectors, v 1 and v 2 , and predicts whether the corresponding questions are duplicates.
The domain classifier g takes a question vector v and predicts whether the question is from the source or from the target domain. We train the encoder not only to do well on the task for the source data, but also to fool the domain classifier, as shown in Algorithm 1. We describe the design choices considered for our domain adaptation model in the following two subsections.

Question Similarity Function
We consider two options for our similarity function f (v 1 , v 2 ): (i) a logistic function that computes the probability that two questions are similar/duplicates, which is trained with the cross-entropy loss: is an element-wise vector product between unit encodings of questions; (ii) a simple cosine similarity function, i.e., cosine(v 1 , v 2 ), trained using the pairwise hinge loss with a margin m: Our experiments reported in Table 3 show that the cosine similarity function performs far better.

Domain Adaptation Components
The adversarial component is responsible for reducing the difference between the source and the target domain distributions.
The main difference between them is in the way the domain discrepancy loss is computed. In the classification-based approach, the adversarial component is a classifier trained to correctly predict the domain (source vs. target) of the input question. In contrast, the question encoder is optimized to confuse the domain classifier, which, as a result, encourages domain invariance. Arjovsky and Bottou (2017) showed that this adversarial optimization process resembles minimizing the Jenson-Shannon (JS) divergence between the source P s and the target distribution P t : where P m = (P s + P t )/2 and KL is the Kullback-Leibler divergence.
Algorithm 1: Training Procedure Input: source data X s ; target data X t Hyper-parameters: learning rates α1, α2; batch size m; adversarial importance λ Parameters to be trained: question encoder θe, question similarity classifier θs and domain classifier θ d Similarity classification loss Lc is either the cross-entropy loss or hinge loss, described in Section 2.1 Adversarial loss L d , described in Section 2.2 repeat for each batch do Construct a sub-batch of similar and dissimilar question pairs from the annotated source data Calculate the classification loss Lc using θe and θs for this sub-batch from the corpora of source and target domains Calculate the domain discrepancy loss L d using θe and θ d for this sub-batch In contrast, the Wasserstein method attempts to reduce the approximated Wasserstein distance (also known as Earth Mover's Distance) between the distributions for the source and for the target domain as follows: where f is a Lipchitz-1 continuous function realized by a neural network.
Arjovsky et al. (2017) have shown that the Wasserstein method yields more stable training for computer vision tasks.

Training Procedure
Algorithm 1 describes the procedure to train the three components of our model. Adversarial training needs two kinds of training data: (i) annotated question pairs from the source domain, and (ii) unlabeled questions from the source and the target domains.
The question encoder is trained to perform well on the source domain using the similarity classification loss L c . In order to enforce good performance on the target domain, the question encoder is simultaneously trained to be incapable in discriminating between question pairs from the source vs. the target domain. This is done through the domain classification loss L d .

Datasets
The datasets we use can be grouped as follows: • Stack Exchange is a family of technical community support forums. We collect questions (composed by title and body) from the XML dumps of four forums: AskUbuntu, Su-perUser, Apple, and Android. Some pairs of similar/duplicate questions in these forums are marked by community users.
• Sprint FAQ is a newly crawled dataset from the Sprint technical forum website. It contains a set of frequently asked questions and their paraphrases, i.e., three similar questions, paraphrased by annotators.
• Quora is a dataset of pairs of similar questions asked by people on the Quora website. They cover a broad set of topics touching upon philosophy, entertainment and politics.
Note that these datasets are quite heterogeneous: the StackExchange forums focus on specific technologies, where questions are informal and users tend to ramble on about their issues, the Sprint FAQ forum is technical, but its questions are concise and shorter, and the Quora forum covers many different topics, including non-technical.
Statistics about the datasets are shown in Table 1. Moreover, in order to quantify the differences and the similarities, we calculated the fraction of unigrams, bigrams and trigrams that are shared by pairs of domains. Table 2 shows statistics about the n-gram overlap between AskUbuntu or Quora as the source and all other domains as the target. As one might expect, there is a larger overlap within the StackExchange family.   (Hochreiter and Schmidhuber, 1997) encoder that operates on 300dimensional GloVe word embeddings (Pennington et al., 2014), which we train on the combined data from all domains. We keep word embeddings fixed in our experiments. For the adversarial component, we use a multi-layer perceptron.
Evaluation Metrics As our datasets may contain some duplicate question pairs, which were not discovered and thus not annotated, we end up having false negatives. Metrics such as MAP and MRR are not suitable in this situation. Instead, we use AUC (area under the curve) to evaluate how well the model ranks positive pairs vs. negative ones. AUC quantifies how well the true positive rate (tpr) grows at various false positive rates (f pr) by calculating the area under the curve starting from f pr = 0 to f pr = 1. We compute the area integrating the false positive rate (x-axis) from 0 up to a threshold t, and we normalize the area to [0,1]. This score is known as AUC(t). It is more stable than MRR and MAP in our case when there could be several false negatives. 3

Choosing the Model Components
Model Selection We select the best components for our domain adaptation model via experimentation on the AskUbuntu-Android domain pair. Then, we apply the model with the bestperforming components across all domain pairs.
Hyperparameters We fine-tune the hyperparameters of all models on the development set for the target domain. Table 3 shows the AUC at 0.05 and 0.1 for different models of question similarity, training on AskUbuntu and testing on Android. The first row shows that using cosine similarity with a hinge loss yields much better results than using a cross-entropy loss. This is likely because (i) there are some duplicate question pairs that were not tagged as such and that have come up as negative pairs in our training set, and the hinge loss deals with such outliers better. (ii) The cosine similarity is domain-invariant, while the weights of the feed-forward network of the softmax layers capture source-domain features.

Domain Adaptation Component
We can see that the Wasserstein and the classification-based methods perform very similarly, after proper hyper-parameter tuning. However, Wasserstein yields better stability, achieving an AUC variance 17 times lower than the one for classification across hyper-parameter settings. Thus, we chose it for all experiments in the following subsections.

When Does Adaptation Work Well?
Tables 4 and 5 study the impact of domain adaptation when applied to various source-target domain pairs, using the Direct approach, BM25, and our adaptation model. We can make the following observations: • For almost all source-target domain pairs from the StackExchange family, domain adaptation improves over both baselines, with an average relative improvement of 5.6%. This improvement goes up to 14% for the AskUbuntu-Android source-target domain pair.
• Domain adaptation on the Sprint dataset performs better than direct transfer, but it is still worse than BM25.
• Domain adaptation from Quora performs the worst, with almost no improvement over direct transfer, which is far behind BM25.
• The more similar the source and the target domains, the better our adaptation model performs. Table 2 shows that AskUbuntu has high similarity to other StackExchange domains, lower similarity to Sprint, and even lower similarity to Quora. The Pearson coefficient (Myers et al., 2010) between the n-gram fractions and the domain adaptation effectiveness for unigrams, bigrams and trigrams is 0.57, 0.80 and 0.94, respectively, which corresponds to moderate-to-strong positive correlation. This gives insight into how simple statistics can predict the overall effectiveness of domain adaptation.

Adapting to Unseen Domains
We also experiment with domain adaptation to a target domain that was not seen during training (even adversarial training). We do so by training to adapt to a pivot domain different from the target. Table 6 shows that this yields better AUC compared to direct transfer when using Apple and Android as the pivot/target domains. We hypothesize that this is due to Apple and Android being closely related technical forums for iOS and Android devices. This sheds some light on the generality of adversarial regularization.

Conclusion and Future Work
We have applied and analyzed adversarial methods for domain transfer for the task of duplicate question detection; to the best of our knowledge, this is the first such work. Our experiments suggest that (i) adversarial adaptation is rather effective between domains that are similar, and (ii) the effectiveness of adaptation is positively correlated with the n-gram similarity between the domains. In future work, we plan to develop better methods for adversarial adaptation based on these observations. One idea is to try source-pivot-target transfer, similarly to the way this is done for machine translation (Wu and Wang, 2007). Another promising direction is to have an attention mechanism (Luong et al., 2015)