Unsupervised Natural Language Inference via Decoupled Multimodal Contrastive Learning

We propose to solve the natural language inference problem without any supervision from the inference labels via task-agnostic multimodal pretraining. Although recent studies of multimodal self-supervised learning also represent the linguistic and visual context, their encoders for different modalities are coupled. Thus they cannot incorporate visual information when encoding plain text alone. In this paper, we propose Multimodal Aligned Contrastive Decoupled learning (MACD) network. MACD forces the decoupled text encoder to represent the visual information via contrastive learning. Therefore, it embeds visual knowledge even for plain text inference. We conducted comprehensive experiments over plain text inference datasets (i.e. SNLI and STS-B). The unsupervised MACD even outperforms the fully-supervised BiLSTM and BiLSTM+ELMO on STS-B.


Introduction
Humans are not supervised by the natural language inference (NLI). Supervision is necessary for applications in human-defined domains. For example, humans need the supervision of what is a noun before they do POS tagging, or what is a tiger in Wordnet before they classify an image of tiger in ImageNet. However, for NLI, people are able to entail that a A man plays a piano contradicts b A man plays the clarinet for his family without any supervision from the NLI labels. In this paper, we define such inference as a more general process of establishing associations and inferences between texts, rather than strictly classifying whether two sentences entail or contradict each other. Inspired by this, we raise the core problem in this paper: Given a pair of natural language sentences, can machines entail their relationship without any supervision from inference labels?
In his highly acclaimed paper, neuroscientist Moshe Bar claims that "predictions rely on the existing scripts in memory, which are the result of real as well as of previously imagined experiences" (Bar, 2009). The exemplar theory argues that humans use similarity to recognize different objects and make decisions (Tversky and Kahneman, 1973;Homa et al., 1981).
Analogy helps humans understand a novel object by linking it to a similar representation existing in memory (Bar, 2007). Such linking is facilitated by the object itself and its context (Bar, 2004). Context information has been widely applied in selfsupervision learning (SSL) (Devlin et al., 2018;de Sa, 1994;He et al., 2020). Adapting context to NLI is even more straightforward. A simple idea of constant conjunction is that A causes B if they are constantly conjoined. Although constant conjunction contradicts "correlation is not causation", modern neuroscience has confirmed that humans use it for reasoning in their mental world (Levy and Steward, 1983). For example, they found an increase in synaptic efficacy arises from a presynaptic cell's repeated and persistent stimulation of a postsynaptic cell in Hebbian theory (Hebb, 2005). As to the natural language, the object and its context can be naturally used to determine the inference. For example, a contradicts b because they cannot happen simultaneously in the same context.
The context representation learned by SSL (e.g. BERT (Devlin et al., 2018)) has already achieved big success in NLP. From the perspective of context, these models (Devlin et al., 2018;Liu et al., 2019) learn the sentence level contextual information (i.e. by next sentence prediction task) and the word level contextual information (i.e. by masked language model task).
Besides linguistic contexts, humans also link other modalities (e.g. visions, voices) to novel inputs (Bar, 2009). Even if the goal is to reason about arXiv:2010.08200v1 [cs.CL] 16 Oct 2020 plain texts, other modalities still help (although they are not provided as inputs) (Kiela et al., 2018). For example, if only textual information is used, it is difficult to entail the contradiction between a and b . We need the commonsense that a man only has two arms, which cannot play the piano and clarinet simultaneously. This commonsense is hard to obtain from the text. However, if we link the sentences to their visual scenes, the contradiction is much clearer because the two scenes cannot happen in the same visual context. We think it is necessary to incorporate other modalities for the unsupervised natural language inference.
The idea of adapting multimodal in SSL is not new. According to (Su et al., 2020), we briefly divide previous multimodal SSL approaches into two categories based on their encoder infrastructures. As shown in Fig. 1a, the first category uses one joint encoder to represent the multimodal inputs (Sun et al., 2019;Alberti et al., 2019;Li et al., 2019Su et al., 2020). Obviously, if the downstream task is only for plain text, we cannot extract the representation of text separately from the joint encoder. So the first category is infeasible for the natural language inference. The second category (Lu et al., 2019;Tan and Bansal, 2019;Sun et al., 2019) first encodes the text and the image separately by two encoders. Then it represents the multimodal information via a joint encoder over the lower layer encoders. This is shown in Fig. 1b. Although the textual representation can be extracted from the text encoder in the lower layer, such representation does not go through the joint learning module and contains little visual knowledge. In summary, the encoders in previous multimodal SSL approaches are coupled. If only textual inputs are given, they cannot effectively incorporate visual knowledge in their representations. Thus their help for entailing the contradiction between a and b is limited.
In order to benefit from multimodal data in plain text inference, we propose the Multimodal Aligned Contrastive Decoupled learning (MACD) network. This is shown in Fig. 1c. Its text encoder is decoupled, which only takes the plain text as inputs. Thus it can be directly adapted to downstream NLI tasks. Besides, we use multimodal contrastive loss between the text encoder and the image encoder, thereby forcing the text representation to align with the corresponding image. Therefore even if the text encoder in MACD only takes the plain text as input,  it still represents visual knowledge. In the downstream plain text inference tasks, without taking images as input, the text encoder of MACD still implicitly incorporating the visual knowledge learned by the multimodal contrastive loss. Note that we do not need a decoupled image encoder in the SSL. So the image encoder in Fig. 1c in MACD takes texts as inputs to provides a more precise image encoder. We will elaborate this in section 2.1.

Problem Formulation
We outline the general decoupled SSL process of MACD in section 2.1, and the downstream unsupervised NLI task in section 2.2.

Decoupled Multimodal SSL
For pretraining MACD, we use the multimodal with N samples. Each sample {x i , y i } consists of a pair of text x i and image y i , which describe the same context. It is straightforward to extend our method to modalities other than texts and images.
MACD learns from D t2i . Since text2image is many-to-many, we use energy-based models to represent their correlations. We first encode x i and y j into one pretext-invariant representation space (Misra and van der Maaten, 2020). The encoders are denoted by f (x i ; θ f ) and g(x i , y i ; θ g ), respectively. We define the energy function where f (x i ; θ f ) denotes the text encoder and g(x i , y i ; θ g ) denotes the image encoder. d is a nonparametric distance metric (e.g. cosine). In the rest of this paper, we will use f (x) and g(x, y) instead of f (x; θ f ) and g(x, y; θ g ) for convenience.
Note that the text encoder f (x) only takes the text as input, while the image encoder g(x, y) takes both the image and the text as input. The higher the value of the energy function σ(), the higher the probability that x and y are in the same context, and vice versa. The forms of the encoders have the following advantages: • The text encoder f (x) and the image input y are decoupled. Therefore we represent x separately without knowing y. This allows us to use f (x) in the downstream plain text inference.
• g(x, y) represents the one-to-many relationship via implicitly introducing the "predictive sparse coding" (Gregor and LeCun, 2010). One image has multiple corresponding texts. To use energy-based models to represent the one-to-many relationship, one common approach is to introduce a noise vector z to allow multiple predictions through one image (Bojanowski et al., 2018). Note that such z can be quickly estimated by the given text x and image y (Gregor and LeCun, 2010). In our proposed image encoder g(x, y), although z is not explicitly introduced, the encoder allows multiple predictions for one image via taking different images as input. Besides, it allows the image to interact with the text in the inner computation, which is an implicit alternative for the predictive z.

Downstream Unsupervised NLI
We use the representation from the pre-trained multimodal SSL to predict the relations of natural language sentence pairs under the unsupervised learning scenario. The testing data can be formulated i and x 2 i . z i indicates the relation between x 1 i and x 2 i . Under the unsupervised setting, we predict z i for given x T i by the similarity of f (x 1 i ) and f (x 2 i ) (e.g. cosine similarity).

Methods
This section elaborates our major methodology. In section 3.1, we show how we maximize the crossmodal mutual information (MI) for the decoupled representation learning. In section 3.2, we show how we incorporate the mutual information (MI) of local structures. We elaborate the encoders in section 3.3. In Section 3.4, in order to solve the catastrophic forgetting problem, we use lifelong learning regularization to anchor the text.

Decoupled Representation Learning by Cross-Modal Mutual Information Maximization
As discussed in section 1, the query object and its context determine the inference. NLI depends on whether the two sentences are in the same context. In this paper, we consider context from different modalities (e.g. text or images). Mutual information maximization has become a trend for SSL (Tian et al., 2019;Hjelm et al., 2019). For cross-modal SSL, we also leverage mutual information I(X, Y ) to represent the correspondence between the text and the image. Intuitively, high mutual information means that the text and the image are well-matched. More formally, the goal of multimodal representation learning is to maximize their mutual information: Eqn.
(2) is intractable and thereby hard to compute. To approximate and maximize I(X, Y ), we use Noise-Contrastive Estimation (NCE) (Gutmann and Hyvärinen, 2010;Oord et al., 2018). First, we use the function σ(x, y) to represent the term P (x|y) where σ global (x, y) : X × Y → R is not a real probability and can be unnormalized. Here we use the notation "global" for the representation learning of a complete text or a complete image to distinguish from the local structures in section 3.2. To compute the cross-modal mutual information, we first encode x and y to f global (x) and g global (y), respectively. Then we use the similarities of their encodings to model P (x|y) P (x) . Note that g global (y) is a specific form of g(x, y) in Eqn. (1). So f global (x) and g global (y) satisfy the form of f and g in Eqn.
(1). We will show how to incorporate the linguistic input when designing the encoder of local visual structures in section 3.3. We follow (Misra and van der Maaten, 2020) to compute the pretext-invariant energy function by the exponential function of their cosine similarity: where τ σ is a hyper-parameter of temperature.
To estimate σ global (x, y) and maximize the mutual information in Eqn.
(2), the NCE loss (Oord et al., 2018) provides a valid toolkit. By taking the posterior probability P (y|x), the NCE loss is defined as: whereP (x) denotes the real distribution of x, P (y|x)P (x) denotes the distribution of y for given x, and P (y) denotes the noise distribution of y. Thus minimizing Eqn. (5) can be seen as identifying the positive image y ∼ P (y|x) for given x from the noise image distribution y ∼ P (y).
It has been proved (Oord et al., 2018) that L NCE:P (y|x) (X, Y ) provides the lower bound of I(X, Y ): where N denotes the number of noise samples and can be seen as a constant. So instead of maximizing I(X, Y ) directly, we minimize L NCE:P (y|x) (X, Y ) instead to maximize its lower bound. Symmetrically, we also compute the NCE loss by taking the posterior probability P (x|y). We define L NCE:P (x|y) as: Eqn. (7) can be seen as identifying the positive text x ∼ P (x|y) for given y from the noise text distribution x ∼ P (x).
By combining Eqn. (5) and Eqn. (7), we derive the loss for global MI maximization Here we say the MI is global, because it is over the complete text and the complete images, which are contrary to the local structures in section 3.2.
Negative sampling In practice, to compute L NCE:P (y|x) (X, Y ), we need to construct noise samples for positive samples. We use all the {x i , y i } pairs in the same minibatch from D t2i as X, Y . Each y i is the positive samples of x i (i.e. P (y i |x i ) = 1). For each x i ∈ X, the noise y in Eqn. (5) are sampled from Y . Likewise, to compute L NCE:P (x|y) (X, Y ) in Eqn. (7), we treat x i as the positive sample for y i , and other texts from the same minibatch as the noise samples.

MI Optimization for Local Structures
In this subsection, we incorporate the local information in multimodal contrastive learning. As demonstrated in DIM (Hjelm et al., 2019), local information plays a greater role in self-supervised learning than the global information.
We follow BERT (Devlin et al., 2018) and DIM to use the words and patches as the local structures for the text and the image, respectively. We maximize the MI between the cross-modal local/global structures. We denote a sentence x with L words as x (1) · · · x (L) , and an image y with M × M patches as y (1) · · · y (M 2 ) .
Similar to the objective of representation learning of global information, we use NCE as the objective of local information representation learning. The difference is that we use the local structurebased alignment to calculate the energy function, while there is no such objective in the representation learning of global information. This objective allows representation learning to emphasize the alignments of local structures between different modalities, such as the alignment between the word "piano" and the corresponding image patches.
Specifically, we use L NCE local (X, Y ) to represent the loss of local information representation learning. The computation of L NCE local (X, Y ) follows Eqn. (5)(7)(8), except that we replace σ global with σ local based on the local information alignment. We will elaborate on σ local in section 3.3.

Alignment-based Local Energy Function and Representation Learning
In this subsection, we show the details of the local energy function σ local and the encoders for local structures.
Following the form of Eqn.
(1), we denote the encoders for the local structures of text as f word (x (i) ). We denote the joint encoder for patches as g local (x, y (i) ), which represents the linguistic information of patch y (i) . Note that the encoder f word (x (i) ) is still decoupled and represents the local linguistic structures without taking image as input. On the other hand, the encoder g local (x, y (i) ) for the local visual structure explicitly incorporate the linguistic information, which is more precise due to the discussion in section 2.1. For a sentence x with L words x (1) · · · x (L) , we represent its local information by encoding it into a local feature map f word (x) = (fword(x (1) ) · · · fword(x (L) )) ∈ R dim×L . For an image y with M × M patches y (1) . . . y (M 2 ) , we represent its spatial locality by encoding it into a feature map g patch (y) = g patch (y (1) ) · · · g patch (y (M 2 ) ) .
The local information across modalities has obvious correlation characteristics (Xu et al., 2018). For example, a word is only related to some patches of the image, but not to other patches. As shown in Fig. 1c, our proposed image encoder is coupled with the text representation. Therefore we assign the local structures with different weights to achieve a more precise image encoder. This is achieved by the attention mechanism in the joint encoder: glocal(x, y (i) ) = exp(attni,j/τc) where τ c denotes the temperature, attn i,j denotes the attention of the i-th word to the j-th patch: We compute the alignment score for the local textual structures by: σ local (x, y (i) ) = d(g patch (y (i) ), g local (x, y (i) )) (11) Here we abuse the notation of σ local since we will use σ local (x, y (i) ) to compute σ local (x, y).
Symmetrically, we also compute the alignment score for the local visual structures by We compute the energy function of x and y based on local structure alignments by: How the model uses the attention mechanism to represent the interactions among local structures and how the energy function is computed is shown in Fig. 2.

Anchor Text via Lifelong Learning
In this subsection, we illustrate how to solve the catastrophic forgetting problem by the lifelong learning regularization.
If we only use the loss in Eqn. (8), the text encoder f (x; θ f ) will tend to only learn vision-related features for text. Since our downstream problem is over the plain text, NLI still relies more on textual features instead of visual features. Compared with the single modality unsupervised natural language representation learning (Devlin et al., 2018), the multimodal model will even perform worse. Similar phenomena called catastrophic forgetting or negative transfer (Sun et al., 2020) often occurs in multi-task learning.
To avoid the catastrophic forgetting, we keep the model's representation for general text while ensuring that it learns visual features. More generally, since there are only data of a certain modality (i.e. plain text) in the downstream task, we anchor this modality in the multimodal SSL phase. We add lifelong learning regularization (Li and Hoiem, 2017) to achieve modality anchoring. For the text encoder, we keep its original textual representation (e.g. by masked language model (MLM) and next sentence prediction in BERT) while learning new visual knowledge. To do this, we follow (Li and Hoiem, 2017) and introduce the distance from the existing text encoder to the original text encoder as the training loss.
Specifically, we use BERT (Devlin et al., 2018) to initialize our text encoder f (x). During multimodal SSL, we keep the textual representation consistent with the original BERT. According to the ablation study in DistilBERT (Sanh et al., 2019), we use the knowledge distillation loss (Hinton et al., 2015) and cosine loss as regularization: where f (x) denotes the textual representation by the original BERT encoder, f i (x) denotes the i-th dimension of f (x), and τ is the temperature.
By combing the lifelong learning regularization, we obtain the final loss for SSL: θx,θy,θα = argmax 4 Experiments

Setup
All the experiments run over a computer with 4 Nvdia Tesla V100 GPUs. Datasets We use Flickr30k (Young et al., 2014) and COCO (Lin et al., 2014) as the text2image dataset D t2i for self-supervised learning. We use STS-B (Cer et al., 2017) and SNLI (Bowman et al.) as the downstream NLI tasks for evaluation. STS-B is a collection of sentence pairs, each of which has a human-annotated similarity score from 1 to 5. The task is to predict these scores. We follow GLUE (Wang et al., 2018) and use Pearson and Spearman correlation coefficients as metrics. SNLI is a collection of human-written English sentence pairs, with manually labeled categories entailment, contradiction, and neutral. Note that for STS-B, some sentence pairs drawn from image captions overlap with Flickr30k. So in order to avoid the potential information leak, we remove all sentence pairs drawn from image captions in STS-B to construct a new dataset STS-B-filter. Similarly, we remove all sentence pairs in SNLI whose corresponding images occur in the training split of D t2i to construct SNLI-filter.
The statistics of these datasets are shown in Ta

Model Details
Encoder details We use BERT-base as the text encoder f global . The local information f word (x (i) ) is the feature vector of the i-th word through BERT.
We use Resnet-50 as the image encoder g global . We use the encoding before the final pooling layer as the representations of M 2 patches g patch (y (i) ). To guarantee that the image encoder and the text encoder are in the same space, we project the feature vectors of the image encoder to the dimension of 768, which is the dimension of BERT. Unsupervised NLI We compute the similarity of two sentences via the cosine of their representations learned by MACD. For STS-B, such similarities are directly used to compute the Pearson and Spearman correlation coefficients. For SNLI, we make inferences based on whether the similarity reaches a certain threshold. More specifically, if the similarity >= ψ 1 , we predict "entailment". If the similarity < ψ 2 , we predict "contradiction". Otherwise we predict "neutral".
Competitors We compare MACD with the single-modal pre-training model BERT, and multimodal pre-training model LXMERT (Tan and Bansal, 2019) and VilBert (Lu et al., 2019). Both LXMERT and VilBert use the network architecture as in Fig. 1b. We extract the lower layer text encoder for unsupervised representation and fine-tuning. We also compare MACD with classical NLP models, including BiLSTM and BiL-STM+ELMO (Peters et al., 2018).
Hyper-parameters We list the hyperparameters below. For ψ 1 and ψ 2 , we use the best set of values chosen in the grid search from range {−1, −0.95, −0.9, · · · , 1}. For τ σ and τ c , we use the best set of values chosen in the grid search from range {0.01, 0.1, 1}. For τ , , γ and β, we follow their settings in DistilBert (Sanh et al., 2019).

Main Results
We evaluate MACD by unsupervised NLI. Table 3 shows the results on STS-B. MACD achieves significantly higher effectiveness than single-modal pre-trained model BERT and multimodal pretrained model LXMERT and VilBert. Note that LXMERT and VilBert use more text2image corpora for pre-training than MACD. This verifies that the joint encoder in previous multimodal SSL cannot represent visual knowledge well in their text encoder. So their adaptations to the single-modal problem are limited. To our surprise, the unsupervised MACD even outperforms fully-supervised models such as BiL-STM and BiLSTM+ELMO. Here the results of BiLSTM and BiLSTM+ELMO for STS-B are directly derived from GLUE (Wang et al., 2018). This verifies the effectiveness of MACD.  Table 3: Effectiveness of unsupervised learning on STS. Baselines with "(sup.)" mean they are trained by supervised labels. Other methods are unsupervised. "P." and "S." mean Pearson and Spearman correlation coefficients, respectively.
We also report the results of MACD on SNLI under the unsupervised setting in Table 4  verifies the effectiveness of our approach for unsupervised NLI. The experimental results suggest that we achieve natural language inference via multimodal self-supervised learning without any supervised inference labels. Since MACD+Filckr30k performs better than MACD+COCO in most cases, we will only evaluate MACD+Filckr30k in the rest experiments. We visualize the distribution of the cosine similarities for samples of different labels in SNLI in Fig. 3 by boxplot. We found obvious distribution patterns by MACD. In contrast, the distributions of other pre-training models have lower correlations with NLI labels.

Fine-tuning
We also evaluated the effectiveness of MACD when fine-tuned under the semi-supervised learning setting. More specifically, we first initialize the parameters of the text encoder as in MACD, then finetune it by the supervised training samples of the downstream tasks. The results are shown in Table 5. MACD also outperforms other approaches. For example, for SNLI-filter, the accuracy of MACD increases by 0.97 compared to the best competitor (i.e. BERT). Note that MACD is the only multimodal method that performs better than BERT. Other multimodal approaches (i.e. LXMERT and VilBert) perform even worse than the original BERT, although they also initialize their text encoders by BERT, and use more text2image data for SSL than MACD. This verifies the effectiveness of the proposed decoupled contrastive learning model.  To further verify the natural language representation learned by the self-supervised learning and get rid of the influence of its neural network architecture (i.e., BERT), Hjelm et al. (2019) suggest training models directly over the features learned by SSL. By following its settings (Hjelm et al., 2019), we use a linear classifier (SVM) and a nonlinear classifier (a single layer perception neural network, marked as SLP) over the features by SSL. The results are shown in Table 6  MACD outperforms the competitors by a large margin. Similar to the results in Table 5, although MACD, LXMERT, and VilBert are all trained by multimodal data, only MACD performs better than the original text encoder (i.e. BERT).

Ablations
In addition to the decoupled contrastive learning model, we propose two optimizations by adding the local structures into account, and by regularizing the model on the text mode via lifelong learning. In order to verify the effectiveness of the two optimizations, we compare MACD with its ablations. The results of unsupervised NLI are shown in

Case studies: Nearest-neighbor analysis
To give a deeper insight into the learned representation, we analyze the k nearest neighbors over the representations. For the query sentence randomly sampled from Flickr30k, we show the results of the 3 nearest sentences according to their L1 distances in

Conclusion
In this paper, we study the multimodal selfsupervised learning for unsupervised NLI. The major flaw of previous multimodal SSL methods is that they use a joint encoder for representing the cross-modal correlations. This prevents us from integrating visual knowledge into the text encoder. We propose the multimodal aligned contrastive decoupled learning (MACD), which learns to represent visual knowledge while using only texts as inputs. In the experiments, our proposed approach steadily surpassed other methods by a large margin.