Wasserstein Distance Regularized Sequence Representation for Text Matching in Asymmetrical Domains

One approach to matching texts from asymmetrical domains is projecting the input sequences into a common semantic space as feature vectors upon which the matching function can be readily defined and learned. In real-world matching practices, it is often observed that with the training goes on, the feature vectors projected from different domains tend to be indistinguishable. The phenomenon, however, is often overlooked in existing matching models. As a result, the feature vectors are constructed without any regularization, which inevitably increases the difficulty of learning the downstream matching functions. In this paper, we propose a novel match method tailored for text matching in asymmetrical domains, called WD-Match. In WD-Match, a Wasserstein distance-based regularizer is defined to regularize the features vectors projected from different domains. As a result, the method enforces the feature projection function to generate vectors such that those correspond to different domains cannot be easily discriminated. The training process of WD-Match amounts to a game that minimizes the matching loss regularized by the Wasserstein distance. WD-Match can be used to improve different text matching methods, by using the method as its underlying matching model. Four popular text matching methods have been exploited in the paper. Experimental results based on four publicly available benchmarks showed that WD-Match consistently outperformed the underlying methods and the baselines.


Introduction
Asymmetrical text matching, which predicts the relationship (e.g., category, similarity) of two text sequences from different domains, is a fundamental problem in both information retrieval (IR) and natural language processing (NLP). For example, * Corresponding author in natural language inference (NLI), text matching has been used to determine whether a hypothesis is entailment, contradiction, or neutral given a premise (Bowman et al., 2015). In question answering (QA), text matching has been used to determine whether a answer can answer the given question (Wang et al., 2007;Yang et al., 2015). In IR, text matching has been widely used to measure the relevance of a document to a query (Li and Xu, 2014;Xu et al., 2020).
One approach to asymmetrical text matching is projecting the text sequences from different domains into a common latent space as feature vectors. Since these feature vectors have identical dimensions and in the same space, matching functions can be readily defined and learned. This type of approach includes a number of popular methods, such as DSSM (Huang et al., 2013), De-cAtt (Parikh et al., 2016), CAFE (Tay et al., 2018a), and RE2 (Yang et al., 2019). In real-world matching practices, it is often observed that learning of the matching models is a process of moving the projected feature vectors together in the semantic space. For example, Figure 1 shows the distribution of the feature vectors generated by RE2. During the training of RE2 (Yang et al., 2019) on SciTail dataset (Khot et al., 2018), it is observed that at the early stage of the training, the feature vectors corresponding to different domains are often separately distributed (according to the visualization by tNSE (Maaten and Hinton, 2008)) ( Figure 1(a)). With the training went on, these separated feature vectors gradually moved closer and finally mixed together ( Figure 1(b) and (c)).
The phenomenon can be explained as follows. Given two text sequences from two asymmetrical domains (e.g., NLI), the first sequence (e.g., premise) and the second sequence (e.g., hypothesis) are heterogeneous and there exists a lexical gap that needs to be bridged between them (Tay et al., 2018c), similar to that of learning crossmodal matching model (Wang et al., 2017a). Existing studies (Wang et al., 2017a;Kamath et al., 2019) have shown that it is essentially critical that the projection network should generate domain-or modal-invariant features. That is, the global distributions of feature vectors should be similar in a common subspace such that their origins cannot be discriminated. The phenomenon is not unique but recurs in the experiments based on other matching models and other datasets.
Existing text matching models, however, are still lack of constraints or regularizations to ensure that the projected vectors are well distributed for matching. One natural question is, can we design a mechanism that can explicitly guide the mix of the feature vectors and better distribute them. To answer the question, this paper presents a novel learning to match method in which the Wasserstein distance (between the two distributions respectively corresponding to the two asymmetrical domains) is introduced as a regularizer, called WD-Match. WD-Match consists of three components: (1) a feature projection component which jointly projects each pair of text sequences into a latent semantic space, as a pair of feature vectors; (2) a regularizer component which estimates the Wasserstein distance with a feed-forward neural network on the basis of the projected features; (3) a matching component which conducts the matching, also on the same set of projected features.
The training of WD-Match amounts to repeatedly interplays between two branches under the adversarial learning framework: a regularizer branch that learns a neural network for estimating the upper bound on the dual form Wasserstein distance, and a matching branch that minimizes a Wasserstein distance regularized matching loss. In this way, the minimization of the loss function leads to a learning method not only to minimize the matching loss, but also to well distribute the feature vectors in the semantic space for better matching.
To summarize, this paper makes the following main contributions: • We highlight the critical importance of the global distribution of the projected feature vectors in matching texts between asymmetrical domains, which has not yet been seriously studied in existing models.
• We propose a new learning to match method under the adversarial framework, in which the text matching model is learned by minimizing a Wasserstein distance-regularized matching loss.
• We conducted empirical studies on four large scale benchmarks, and demonstrated that WD-Match achieved better performance than the baselines and the underlying models. Extensive analysis showed the effects of Wasserstein distance-based regularizer in terms of guiding the distributions of feature vectors and improving the matching accuracy.
The source code of WD-Match is available at https://github.com/RUC-WSM/WD-Match

Related Work
In this section, we first review the sequence representation used in text matching, then introduce the Wasserstein distance and its applications.

Sequence Representation in Text Matching
Sequence representation lies in the core of text matching (Xu et al., 2020). Early works inspired by Siamese architecture assign respective neural networks to encode two input sequences into highlevel representations. For example, DSSM (Huang et al., 2013) is one of the classic representationbased matching approaches to text matching which uses feed-forward neural networks to project a text sequence. CDSSM (Shen et al., 2014), ARC-I (Hu et al., 2014) and CNTN (Qiu and Huang, 2015) change sequence encoder to a convolutional neural network which shares parameters in a fixed size sliding window. To further capture the long-term dependence of a text sequence, a group of recurrent neural network based methods were proposed, including RNN-LSTM (Palangi et al., 2016) and MV-LSTM (Wan et al., 2015). Recently, with the help of attention mechanism (Parikh et al., 2016), the sequence representation is obtained by aligning the sequence itself and the other sequence in the input pairs. For example, CSRAN (Tay et al., 2018b) performs multi-level attention refinement with dense connections among multiple levels. DRCN (Kim et al., 2019) stacks encoding layers and attention layers, then concatenates all previously aligned results. RE2 (Yang et al., 2019) introduces a consecutive architecture based on augmented residual connection between convolutional layers and attention layers. These models yield strong performance on several benchmarks.

Wasserstein Distance
Wasserstein distance (Chen et al., 2018) is a metric based on the theory of optimal transport. It gives a natural measure of the distance between two probability distributions.
Wasserstein distance has been successfully used in the Generative Adversarial Networks (GAN) (Goodfellow et al., 2014) framework of deep learning. Arjovsky et al. (2017) propose WGAN which uses the Wasserstein-1 metric as a way to improve the original framework of GAN, to alleviate the vanishing gradient and the mode collapse issues in the original GAN. The Wasserstein distance has also been explored to learn the domaininvariant features in domain adaptation tasks. For example, Chen et al. (2018) propose to minimize the Wasserstein distance between the feature distributions of the source and the target domains, yielding better performance and smoother training than the standard training method with a Gradient Reversal Layer (Ganin et al., 2016). Shen et al. (2017b) propose to learn domain-invariant features with the guidance of Wasserstein distance.
Inspired by its success in variant applications, this paper introduces Wasserstien distance to text matching in asymmetrical domains, as a regularizer to improve the sequence representations.

Our Approach: WD-Match
In this section, we describe our proposed method WD-Match.

Model Architecture
Suppose that we are given a collection of N instances of sequence-sequence-label triples: and z i ∈ Z respectively denote the first sequence, the second sequence, and the label indicating the relationship of X i and Y i . As shown in Figure 2, WD-Match consists of three components: The feature projection component: Given a sequence pair (X, Y ), it is first processed by the feature projection component F , where the feature projection function F outputs a pair of K-dimensional feature vectors h X , h Y in the semantic space. We suppose that F is a neural network with a set of parameters θ F and all the parameters in θ F are sharing for X and Y .
The matching component: The output vectors from the feature projection component are then fed to the matching component M , M outputs the predicted labelẑ. We suppose that M is a neural network with a set of parameters θ M .
The regularizer component: Given two sets of the projected feature vectors h X and h Y , the regularizer component estimates the Wasserstein distance between P X F and P Y F , we denote P X F and P Y F are two distributions defined over the two groups of feature vectors h X and h Y respectively.
Word Embeddings

Matching Component
Prediction Label where '∼' means that the pairs (X, Y ) are sampled from the joint space X × Y. Specifically, the Wasserstein distance between two probabilistic distributions P X F and P Y F is defined as:

Regularizer Component
where J (P X F , P Y F ) denotes all joint distributions, γ stands for (X, Y ) that have marginal distributions P X F and P Y F . It can be shown that W has the dual form (Villani, 2003): (2) where '|G| L ≤ 1' denotes that the 'sup' is taken over the set of all 1-Lipschitz 1 function G; and function G : R K → R maps each K-dimensional feature vector in the semantic space to a real number. In this paper, G is set as a two-layer feedforward neural network with a set of parameters Please note that different configurations of the feature projection component F , matching component M , and matching loss L m leads to different matching models. Therefore, WD-Match can improve a matching model by setting the matching method as its underlying model.

Adversarial Training
To learn the model parameters {θ F , θ M , θ G }, WD-Match sets up two training goals: minimizing the Wasserstein distance between P X F and P Y F , and minimizing the loss in prediction in terms of the mistakenly predicted matching labels. The training process can be implemented under the adversarial learning framework and amounts to repeatedly interplays between two learning branches: the regularizer branch and the matching branch.
In the regularizer branch, the objective term in the dual form Wasserstein distance (Equation (2)) is approximately written as: the parameters θ G can achieve an approximation of the Wasserstein distance between P X F and P Y F in the semantic space defined by F : To make G a Lipschitz function (up to a constant) and following the practices in (Arjovsky et al., 2017), all of the parameters in θ G are always clipped to a fixed range [−c, c]. In practice, the sequence pairs for training G are randomly sampled from the training set D. Note that L wd still takes θ F as parameters because it is calculated on the basis of features generated by F .
The matching branch simultaneously updates the matching network M and feature projection network F by seeking the minimization of the Wasserstein distance-regularized matching loss: (4) where λ ∈ [0, 1] is a trade-off coefficient to balance the matching loss and regularizer, and L m (θ F , θ M ) is defined as where m (·, ·) is the matching loss function defined over each sequence-sequence-label triple in the training data. It can be, for example, the crossentropy loss that measure the goodness of the predicted labelẑ = M (F (X, Y )) by the matching network, compared to the ground truth label z.
Algorithm 1 shows the general procedure of WD-Match. WD-Match takes training set D = and a number of hyper-parameters as inputs, and outputs the learned parameters θ F and θ M . WD-Match run multiple rounds until convergence, and at each round it estimates the Wasserstein distance of the projected features and then update the projection component F and matching component M . At each round, WD-Match alternatively maintains two branches. The regularizer branch updates the parameters θ G , with the θ F fixed 2 . It contains a sub-iteration in which the parameters are optimized in an iterative manner: First, objective O G is constructed based on the sampled sequence pairs (line 4 -line 6); Then θ G is updated with gradient ascent (line 7); Finally, each parameter in θ G is clipped to [−c, c] for satisfying the 1-Lipschitz constraint (line 8). The matching branch updates θ F and θ M , with θ G fixed. It first samples another mini-batch data from the training data and estimates the regularized loss L adv using the fixed G (line 11 -line 13). Then, the gradients of the parameters is estimated and used to update the parameters (line 14).

Experiments
We conducted experiments to test the performances of WD-Match, and analyzed the results.
2 Note that the regularizer does not depend on M , given F .

Algorithm 1
The WD-Match algorithm.

Datasets and Metrics
We use four large scale publicly matching benchmarks: SNLI (Stanford Natural Langauge Inference) (Bowman et al., 2015), SciTail (Khot et al., 2018), TrecQA (Wang et al., 2007), WikiQA (Yang et al., 2015). Table 1 provides a summary of the datasets used in our experiments. SNLI 3 is a benchmark for natural language inference. In SNLI, each data record is a premisehypothesis-label triple. The premise and hypothesis are two sentences and the label could be "entailment", "neutral", "contradiction", or "-". In our experiments, following the practices in (Bowman et al., 2015), the data with label "-" are ignored. We follow the original dataset partition. Accuracy is used as the evaluation metric for this dataset.
SciTail 4 is an entailment dataset based on multiple-choice science exams and web sentences. Each record is a premise-hypothesis-label triple. The label is "entailment" or "neutral", because scientific factors cannot contradict. We follow the original dataset partition. Accuracy are used as the evaluation metric for this dataset.
TrecQA 5 is a answer sentence selection dataset designed for the open-domain question answering setting. We use the raw version TrecQA, questions with no answers or with only positive/negative answers are included. The raw version has 82 questions in the development set and 100 questions in the test set. Mean average precision (MAP) and mean reciprocal rank (MRR) are used as the evaluation metrics for this task.
WikiQA 6 is a retrieval-based question answering dataset based on Wikipedia. We follow the data split of original paper. This dataset consists of 20.4k training pairs, 2.7k development pairs, and 6.2k testing pairs. We use MAP and MRR as the evaluation metrics for this task.
Specifically, in WD-Match(RE2), F is a stacked blocks which consist of multiple convolution layers and multiple attention layers, and M is an MLP; in WD-Match(DecAtt), F is an attention layer and a aggregation layer, M is an MLP. Please note that we did not implement the Intra-Sentence Attention in our experiments; in WD-Match(CAFE), F is a highway encoder with a alignment layer and a factorization layer and M is another highway network.
Please note that we remove the character embedding and position embedding in our experiments; in WD-Match(BERT), F is a pre-trained BERT-base 7 model, M is an MLP. Please note that for easing of combining with WD-Match, BERT was only used to extract the sentence features separately in our experiments. The G module for four models are identical: a non-linear projection layer and a linear projection layer.
For all models, the parameters of F and M were directly set as its original settings. In the training, all models were trained using the Adam optimizer with learning rate η 2 tuned amongst {0.0001, 0.0005, 0.001}. Batch size n 2 was tuned amongst {256, 512, 1024}. The trade-off coefficient λ was tuned from [0.0001, 0.01]. Clipping threshold was tuned from [0.1, 0.5]. Word embeddings were initialized with GloVe (Pennington et al., 2014) and fixed during training. We implemented WD-Match models in Tensorflow. Table 2 reports the results of WD-Match and the popular baselines on the SNLI test set. The baselines results are reported from their original papers. From the results, we found that WD-Match (RE2) outperformed all of the baselines, including the underlying model RE2. The results indicate the effectiveness of WD-Match and its Wasserstein distancebased regularizer in the asymmetric matching tasks of natural language inference. We further tested the performances of WD-Match (DecAtt) and WD-Match(BERT) which used DecAtt and BERT as the underlying matching models, respectively, to show whehter WD-Match can improve a matching method by using the method as its underlying model. From the results shown in Table 2, we can see that on SNLI, WD-Match (DecAtt) ourperform DecAtt in terms of accuracy. Similarly, WD-Match (BERT) improved BERT about 0.4 points in terms of accuracy. Table 3 reports the results of WD-Match and the baselines on the SciTail test set. The baselines results are reported from the original papers. We found that WD-Match (RE2) outperformed all of the baselines. The result further confirm WDmatch's effectiveness in the asymmetric matching task of scientific entailment. We also tested the performances of WD-Match (DecAtt) and WD-

Models
MAP(%) MRR(%) KVMN (Miller et al., 2016) 70.69 72.65 BiMPM (Wang et al., 2017b) 71.80 73.10 IWAN  73.30 75.00 CA (Wang and Jiang, 2016) 74.33 75.45 HCRN (Tay et al., 2018c) 74.30 75.60 RE2 (Yang et al., 2019) 74  We list the number of parameters of different text matching models in Table 2. Compared to the underlying model, the additional parameters of WD-Match come from the regularizer component G. We can see that the parameters of the regularizer component G are far less than the underlying model. G module is implemented as a two-layer MLP (the number of neurons in the second layer is set as one). Therefore, the additional computing cost comes from the training of the two-layer MLP, which is of O(T * N * K * 1), where T is the number of training iterations, N number of training examples, K number of neurons in the first layer of MLP (without considering the compute cost of the activation function). We can see that the additional computing overhead is much lower than that of the underlying methods which usually learn much more complex neural networks for the feature projection and the matching.
Summarizing the results above and the results reported in Section 4.3, we can conclude that WD- The orange 'X' and green 'Y ' correspond to P X F and P Y F of RE2, The dark blue 'X' and red 'Y ' correspond to P X F and P Y F of WD-Match (RE2), respectively.
Match is a general while strong framework that can improve different matching models by using them as its underlying matching model.

Visualization of the Distributions of Feature Vectors
Figure 1(a) shows that there exists a gap between two feature vectors, due to the heterogenous nature of the texts from two asymmetrical domains.
We conducted experiments to analyze how the feature vectors (i.e., h X and h Y ) generated by WD-Match distributed in the common semantic space, using WD-Match(RE2) as an example. Specifically, we trained a RE2 model and a WD-Match (RE2) model based on SciTail dataset. Note that in this experiment, the adversarial training step k is set as 5, that is, WD-Match (RE2) repeats regularizer branch for 5 times before matching branch. We recorded all of the training feature vectors (i.e., h X and h Y ) and illustrated them in the Figure 3 by t-SNE . The orange 'X' and green 'Y ' correspond to P X F and P Y F of RE2, The dark blue 'X' and red 'Y ' correspond to P X F and P Y F of WD-Match (RE2), respectively. As we can see from Figure 3, the feature vectors from RE2 are separately distributed while the feature vectors from WD-Match (RE2) are indistinguishable. It demonstrates that compared to the underlying model RE2, WD-Match (RE2) distributes the feature vectors in semantic space better and faster.

Convergence and Effects of Wasserstein
Distance-based Regularizer We conducted experiments to test how the Wasserstein distance-based regularizer guides the training of matching models. Specifically, we tested the WD-Match (RE2) and RE2 models generated at each training epochs. The accuracy curve on the basis of development set of SNLI was illustrated in Figure 4 (denoted as "WD-Match (RE2)-Accuracy" and "RE2-Accuracy"). Comparing these two training curves, we can see that WD-Match (RE2) outperformed RE2 when the training closing to converge (after about 15 epochs). We can conclude that WD-Match (RE2) obtained higher accuracy than RE2.
To investigate how the Wasserstein distance guides the training of matching models, we recorded the estimated Wasserstein distances at all of the training epochs of RE2 and WD-Match (RE2) based on the converged G network. The curve "WD-Diff" shows the differences between the Wasserstein distances by RE2 and that of by WD-Match (RE2) at each of the training epoch (i.e., L wd (θ F ) of RE2 minus L wd (θ F ) of WD-Match (RE2)). From the curve we can see that at the beginning of the training (i.e., epoch 1 to 5), the "WD-Diff" was near to zero. With the training went on (i.e., epoch 5 to 30), the Wasserstein distance by WD-Math(RE2) became smaller than that of by RE2 (the WD-Diff curve is above the zero line), which means that WD-Match (RE2)'s feature projection module F was guided to move feature vectors together more thoroughly and faster, which are more suitable for matching. The results indicate WD-Match achieved its design goal of guiding the distributions of the projected feature vectors.
It is interesting to note that, comparing all of the three curves in Figure 4, we found the WD-Diff curve is close to zero at the beginning of the training, and the accuracy curves of WD-Match (RE2)-Accuracy and RE2-Accuracy are similar at the beginning. With the training went on (after epoch 10), the Wasserstein distance differences became larger. At the same time, the accuracy gaps (between WD-Match (RE2)-Accuracy and RE2-Accuracy) also become larger. The results clearly reflect the effects of Wasserstein distance-based regularizer: minimizing the regularizer leads to better distribution of feature vectors in terms of matching.

Conclusion and Future Work
In this paper, we proposed a novel Wasserstein distance-based regularizer to improve the sequence representations, for text matching in asymmetrical domains. The method, called WD-Match, amounts to adversarial interplay of two branches: estimating the Wasserstein distance given the projected features, and minimizing the Wasserstein distance regularized matching loss. We show that the regularizer helps WD-Match to well distribute the generated feature vectors in the semantic space, and therefore more suitable for matching. Experimental results on four benchmarks showed that WD-Match can outperform the baselines including its underlying models. Empirical analysis showed the effectiveness of the Wasserstein distance-based regularizer in text matching.
In the future, we plan to study different regularizers in the asymmetrical text matching task, for further exploring their effectiveness in bridging the gap between asymmetrical domains.