Reversing Gradients in Adversarial Domain Adaptation for Question Deduplication and Textual Entailment Tasks

Adversarial domain adaptation has been recently proposed as an effective technique for textual matching tasks, such as question deduplication. Here we investigate the use of gradient reversal on adversarial domain adaptation to explicitly learn both shared and unshared (domain specific) representations between two textual domains. In doing so, gradient reversal learns features that explicitly compensate for domain mismatch, while still distilling domain specific knowledge that can improve target domain accuracy. We evaluate reversing gradients for adversarial adaptation on multiple domains, and demonstrate that it significantly outperforms other methods on question deduplication as well as on recognizing textual entailment (RTE) tasks, achieving up to 7% absolute boost in base model accuracy on some datasets.


Introduction
Domain adaptation is a flexible machine learning approach that allows the transfer of category independent information between domains.Through domain adaptation we can leverage source task representations to bring the source and target distributions closer in a learned joint feature space.In this paper we are focused only on semi-supervised domain adaptation -when knowledge from a large labeled dataset in a source domain can be somewhat transferred to help improve the same task on a target domain, which typically has a significantly smaller number of labels.In particular, this paper focuses on domain adaptation for the detection of question duplicates in community question answering forums (Shah et al., 2018;Hoogeveen et al., 2015), as well as for RTE tasks (Dagan et al., 2005;Zhao et al., 2017).
Generally speaking, the effectiveness of domain adaptation depends essentially on two factors: the similarity between source and target domains, and representation strategy to transfer the source domain knowledge.Long et al. showed transferring features across domains becomes increasingly difficult as domain discrepancy increases (Long et al., 2017), since the features learned by models gradually transition from general to highly domain specific as training progresses.Recent domain adaptation strategies attempt to counter this issue by making certain features invariant across source and target domains using distribution matching (Cao et al., 2018) or minimizing distance metrics between the representations (Sohn et al., 2019).
The idea of generating domain invariant features was further enhanced by the use of adversarial learning methods.Recent work has advocated for tuning networks using a loss functions that reduce the mismatch between source and target data distributions (Sankaranarayanan et al., 2018;Tzeng et al., 2017).Others have proposed a domain discriminator that maximizes the domain classification loss between source and target domains (Cohen et al., 2018;Shah et al., 2018).One particular limitation of these approaches is that they are restricted to using only the shared domain invariant features and hence can't benefit from target domain specific information.Small amounts of labeled target domain data could in principle be used to fine-tune learned shared representations and improve the target task, however this could also lead to overfitting (Sener et al., 2016).
To address this issue, Qiu et al. used both shared domain invariant and domain specific features: while the shared features are learned by maximizing domain discriminator loss, the domain specific features are learned by jointly minimizing the task loss and the domain classification loss by domain specific discriminators (Qiu et al., 2018).Similar ideas were put forth by Peng et al for cross-domain sentiment classification where they demonstrate the effectiveness of using both domain specific and domain invariant features (Peng et al., 2018).Moreover, Bousmalis et al have made similar observations in domain adaptation for image classification and related vision tasks (Bousmalis et al., 2016).All these studies follow similar approach of learning shared feature space by maximizing domain classification loss.
In contrast, our work here enhances the ideas from from Qiu et al. by utilizing a Gradient Reversal Layer (GRL) (Ganin and Lempitsky, 2015) to train the domain discriminator in a minimax game manner, and show that it results in significantly better transfer performance to multiple target domains.The use of gradient reversal layer is further advocated by works of Elazar et al (Elazar and Goldberg, 2018) and Fu et al (Fu et al., 2017) for removal of demographic attributes from text, and relation extraction from text, respectively.To the best of our knowledge, the use of Gradient Reversal in textual matching tasks, such as question deduplication and RTE, is novel and may trigger further applications of this approach in other language tasks.
To summarize our contributions, (1) we propose a novel approach for adversarial domain adaptation that uses gradient reversal layers to discover shared representations between source and target domains on textual matching tasks, and elegantly combines domain specific and domain invariant shared features.(2) We apply it to question deduplication tasks and empirically confirm that it outperforms all other strong baselines and feature sets on five different domains, with absolute accuracy gains of up to 4.5%.(3) We further apply the same approach to two different textual entailment domains, where it again outperforms other baselines by as much as 7% absolute accuracy points.

Base Model:BiMPM
Wang et al. (Wang et al., 2017) proposed the Bilateral Multi-Perspective Matching model for many language tasks, including question duplicate detection and RTE.This model takes in the two candidate sentences as inputs to a Bi-LSTM layer that generates hidden representations for both of them.These representations are passed on to a multiperspective matching block that uses four differ- ent matching mechanisms -full matching, maxpooling matching, attentive matching and max attentive matching to generate matched representations of all words of both the sentences.This matching takes place in both the directions, i.e. if P and Q are the two input sentences, then representations for all words of P are computed by matching with words of Q, and same is done for all words of Q by matching with all words of P.These representations are then fed into an aggregation layer followed by fully connected layers for classification.In our experiments, we modified this architecture by replacing the aggregation LSTM in the aggregation layer by an aggregating attention layer, and replacing the following fully connected layers by a bilinear layer.

Adversarial Domain Adaptation Methods
The overall architecture used for prediction makes use of both shared and domain specific features.The shared features are learned in an adversarial fashion wherein the desired feature layer that needs to be shared sends its output to a domain discriminator.For our experiments, we plug in this domain discriminator at the base of the model, right after the Bi-LSTM layer.This is to ensure that the layers following Bi-LSTM are trained only for the duplicate classification task, and use domain invariant features generated by the Bi-LSTM.Our work uses two domain discriminators -shared domain discriminator with gradient rever- sal layer (explained below), that is used to train shared Embedding and Bi-LSTM layers to generate domain invariant features, and unshared domain discriminator that is used to train all the domain specific Embedding and Bi-LSTM layers to generate highly domain specific features.These discriminators consist of an aggregation layer (attention mechanism), followed by a fully connected layer for domain classification (see Figures 1(a) and 1(b)).
The shared domain discriminator uses a Gradient Reversal Layer (GRL) (see Figure 1(a)) that acts as an identity transform in the forward pass through the network.During the backward pass however, this layer multiplies the incoming gradient by a negative factor −λ which reverses the gradient direction.The use of this layer allows the domain discriminator to be trained in a minimax game fashion, where the domain classification layer tries to minimize the domain classification loss, thus trying to be better at this task, while feature extraction layers (layers before GRL) act as adversaries by trying to make the task harder for domain classification layer.This ensures that feature extraction layers are as ineffective as possible for domain classification, thus bringing the feature maps of both domains closer.As a result, the desired feature layers should generate shared feature representations that are almost indistinguishable by the domain classification layer.The shared features obtained from shared Bi-LSTM should also be more effective to transfer than the ones obtained by simply maximizing the domain classification loss throughout the domain discriminator and base model layers.
The domain specific features are learned using an unshared domain discriminator that is identical to the domain discriminator used for shared features, except that the GRL is replaced by identity transform layer (see Figure 1(b)).This layer however, multiplies the incoming gradient by a positive factor +λ to maintain uniformity in gradient magnitudes with shared domain discriminator.This domain discriminator tries to minimize the domain classification loss, as do the preceding layers and thus the desired feature layer learns to generate highly domain specific feature representations.
A block diagram of the proposed adversarial learning framework for domain adaptation has been shown in Figure 3.

Model Architecture
The training data has sentence pairs (Q S ) from source domain S, and sentence pairs (Q T ) from target domain T .Figures 1 and 2 show the overall architecture of the model.The initial layers of the network -Embedding, Bi-LSTM and multiperspective match block -are of two kinds: shared and domain specific.Shared layers are used in the network for sentences of all domain types, whereas the domain specific layers work on sentences of only corresponding domains.The Embedding layers can be appropriately initialized and trained end-to-end along with the rest of the network.Each domain has domain specific aggregation and classification (fully connected) layers as well.The aggregation layer takes in the domain specific and shared features as inputs (Figure 2), aggregates them and concatenates these aggregated vectors to form a combined representation.
This combined feature vector is passed to the classification layers for task classification.

Model Training
The forward propagation through the model involves 5 passes, which are listed below: • Pass 1 (Figure 1 The source domain layers are trained by minimizing L S (Equation 1).The target domain layers are trained by minimizing L T (Equation 2).The shared embedding, Bi-LSTM and aggregation layers are learned by minimizing L Sh (Equation 3), while fully connected layer of shared domain discriminator minimizes L 1 . (1) Note that not all domain specific layers contribute to losses L 2 and L 3 , and thus the gradient due to these losses affects only the Embedding and Bi-LSTM layers for all domains.We trained all the models and tuned all the hyperparameters to optimize the validation set performance on target domain data.

Datasets
For question duplicate detection, we use the Quora question pairs dataset (Quora, 2017) as the source domain dataset and 5 datasets that are from different and diverse set of domains as our target domains.The Android, Mathematica, Programmers and Unix question datasets were used from the Stack Exchange dataset (StackExchchange, 2018).We obtained the Tax Domain Qs from a popular forum for tax related question answers, which we plan to make public shortly.For RTE, the Stanford Natural Language Inference (SNLI) (SNLI, 2015) has been used as source domain, and for target domains we used The Guardian Headlines RTE (RTE, 2012) and SICK (SICK, 2014) datasets.
The size for all these datasets has been mentioned in Table 1 in the (train/ validation/ test) format.

Results
In Table 1 we compared the base model BiMPM (base) trained only on the target domains to three variants of the same model, each obtained after a different approach for adversarial domain adaptation.Model T1 was trained by using both the shared and domain specific features, but maximizing the domain classification loss to learn shared features.Model T2 used only the shared features learned using gradient reversal strategy, along with fine-tuned features obtained from later layers of the network.Model T3 used both the domain specific features as well as the shared features learned using the gradient reversal method.The accuracy of these models for five different question deduplication and two RTE target domains is reported in Table 1.Comparisons of accuracy numbers between different rows are fairly consistent across all domains1 , enabling us to draw the following empirical claims: T1, T2 and T3 outperform baseline, hence enforcing the effectiveness of adversarial domain adaptation in all tasks in Table 1.
T3 outperforms T2, thus indicating that learning a combination of domain specific and shared representations is quite beneficial for all domain transfer experiments in Table 1.This observation was also noted by Qiu et al (Qiu et al., 2018), even if without the use of gradient reversal.
Both T2 and T3 outperform T1, hence providing strong evidence that GRL significantly improves overall feature learning if compared to maximizing the domain classification loss.In particular, the comparison between T3 and T1, shows that learning exactly the same feature set using GRL for adversarial domain adaptation is more effective than maximizing the loss.
T3 outperforms all other models, showing that our proposed approach consistently beats all other settings for domain adaptation in both ques-

Discussion and Conclusion
We systematically evaluated different adversarial domain adaptation techniques for duplicate question detection and RTE tasks.Our experiments showed that adversarial domain adaptation using gradient reversal yields the best knowledge transfer between all textual domains in Table 1.This method outperformed existing domain adaptation techniques, including recently proposed adversarial domain adaptation method of maximizing the domain classification loss by a discriminator.Furthermore, we show that the models that use both domain specific features and shared features outperform the models that use only either of these features.

Figure 1 :
Figure 1: (a) Architecture for data flow of pass 1, (b) Architecture for data flow of passes 2 and 3

Figure 2 :
Figure 2: Architecture for data flow of passes 4 and 5

Figure 3 :
Figure 3: Adversarial Learning Framework for Domain Adaptation (a)) -Q S and Q T through shared layers and shared domain discriminator (Loss = L 1 ).• Pass 2 (Figure 1(b)) -Q S through domain specific layers and unshared domain discriminator (Loss = L 2 ).• Pass 3 (Figure 1(b)) -Q T through domain specific layers and unshared domain discriminator (Loss = L 3 ).• Pass 4 (Figure 2) -Q S through domain specific and shared layers for task classification (Loss = L 4 ).• Pass 5 (Figure 2) -Q T through domain specific and shared layers for task classification (Loss = L 5 ).