Cross-Modality Relevance for Reasoning on Language and Vision

This work deals with the challenge of learning and reasoning over language and vision data for the related downstream tasks such as visual question answering (VQA) and natural language for visual reasoning (NLVR). We design a novel cross-modality relevance module that is used in an end-to-end framework to learn the relevance representation between components of various input modalities under the supervision of a target task, which is more generalizable to unobserved data compared to merely reshaping the original representation space. In addition to modeling the relevance between the textual entities and visual entities, we model the higher-order relevance between entity relations in the text and object relations in the image. Our proposed approach shows competitive performance on two different language and vision tasks using public benchmarks and improves the state-of-the-art published results. The learned alignments of input spaces and their relevance representations by NLVR task boost the training efficiency of VQA task.


Introduction
Real-world problems often involve data from multiple modalities and resources. Solving a problem at hand usually requires the ability to reason about the components across all the involved modalities. Examples of such tasks are visual question answering (VQA) (Antol et al., 2015;Goyal et al., 2017) and natural language visual reasoning (NLVR) (Suhr et al., 2017(Suhr et al., , 2018. One key to intelligence here is to identify the relations between the modalities, combine and reason over them for decision making. Deep learning is a prominent technique to learn representations of the data for decision making for various target tasks. It has achieved supreme performance based on large scale corpora (Devlin et al., 2019). However, it is a challenge to learn joint representations for crossmodality data because deep learning is data-hungry. There are many recent efforts to build such multimodality datasets (Lin et al., 2014;Krishna et al., 2017;Antol et al., 2015;Suhr et al., 2017;Goyal et al., 2017;Suhr et al., 2018). Researchers develop models by joining features, aligning representation spaces, and using Transformers (Li et al., 2019b;Tan and Bansal, 2019). However, generalizability is still an issue when operating on unobserved data. It is hard for deep learning models to capture high-order patterns of reasoning, which is essential for their generalizability.
There are several challenging research directions for addressing learning representations for crossmodality data and enabling reasoning for target tasks. First is the alignment of the representation spaces for multiple modalities; second is designing architectures with the ability to capture high-order relations for generalizability of reasoning; third is using pre-trained modules to make the most use of minimal data.
An orthogonal direction to the above-mentioned aspects of learning is finding relevance between the components and the structure of various modalities when working with multi-modal data. Most of the previous language and visual reasoning models try to capture the relevance by learning representations based on an attention mechanism. Finding relevance, known as matching, is a fundamental task in information retrieval (IR) (Mitra et al., 2017). Benefiting from matching, Transformer models gain the excellent ability to index, retrieve, and combine features of underlying instances by a matching score (Vaswani et al., 2017), which leads to the state-of-the-art performance in various tasks (Devlin et al., 2019). However, the matching in the attention mechanism is used to learn a set of weights to highlight the importance of various components. In our proposed model, we learn representations directly based on the relevance score inspired by the ideas from IR models. In contrast to the attention mechanism and Transformer models, we claim that the relevance patterns are as important. With proper alignment of the representation spaces of different input modalities, matching can be applied to those spaces. The idea of learning relevance patterns is similar to Siamese networks (Koch et al., 2015) which learn transferable patterns of similarity of two image representations for one-shot image recognition. Similarity metric between two modalities is shown to be helpful for aligning multiple spaces of modalities (Frome et al., 2013).
The contributions of this work are as follows: 1) We propose a cross-modality relevance (CMR) framework that considers entity relevance and highorder relational relevance between the two modalities with an alignment of representation spaces. The model can be trained end-to-end with customizable target tasks. 2) We evaluate the methods and analyze the results on both VQA and NLVR tasks using VQA v2.0 and NLVR 2 datasets respectively. We improve state-of-the-art on both tasks' published results. Our analysis shows the significance of the patterns of relevance for the reasoning, and the CMR model trained on NLVR 2 boosts the training efficiency of the VQA task.

Related Work
Language and Vision Tasks. Learning and decision making based on natural language and visual information has attracted the attention of many researchers due to exposing many interesting research challenges to the AI community. Among many other efforts (Lin et al., 2014;Krishna et al., 2017;, Antol et al. proposed the VQA challenge that contains open-ended questions about images that require an understanding of and reasoning about language and visual components. Suhr et al. proposed the NLVR task that asks models to determine whether a sentence is true based on the image. Attention Based Representation. Transformers are stacked self-attention models for general purpose sequence representation (Vaswani et al., 2017). They have been shown to achieve extraordinary success in natural language processing not only for better results but also for efficiency due to their parallel computations. Self-attention is a mechanism to reshape representations of components based on relevance scores. They have been shown to be effective in generating contextualized representations for text entities. More importantly, there are several efforts to pre-train huge Transformers based on large scale corpora (Devlin et al., 2019;Yang et al., 2019;Radford et al., 2019) on multiple popular tasks to enable exploiting them and performing other tasks with small corpora. Researchers also extended Transformers with both textual and visual modalities (Li et al., 2019b;Sun et al., 2019;Tan and Bansal, 2019;Su et al., 2020;Tsai et al., 2019). Sophisticated pre-training strategies were introduced to boost the performance (Tan and Bansal, 2019). However, as mentioned above, modeling relations between components is still a challenge for the approaches that try reshaping the entity representation space while the relevance score can be more expressive for these relations. In our CMR framework, we model high-order relations in relevance representation space rather than the entity representation space. Matching Models. Matching is a fundamental task in information retrieval (IR). There are IR models that focus on comparing the global representation matching (Huang et al., 2013;Shen et al., 2014), the local components (a.k.a terms) matching Pang et al., 2016), and hybrid methods (Mitra et al., 2017). Our relevance framework is partially inspired by the local components matching which we apply here to model the relevance of the components of the model's inputs. However, our work differs in several significant ways. First, we work under the cross-modality setting. Second, we extend the relevance to a highorder, i.e. model the relevance of entity relations. Third, our framework can work with different target tasks, and we show that the parameters trained on one task can boost the training of another.

Cross-Modality Relevance
Cross-Modality Relevance (CMR) aims to establish a framework for general purpose relevance in various tasks. As an end-to-end model, it encodes the relevance between the components of input modalities under task-specific supervision. We further add a high-order relevance between relations that occur in each modality. Figure 1 shows the proposed architecture. We first encode data from different modalities with single modality Transformers and align the encoding  Figure 1: Cross-Modality Relevance model is composed of single-modality transformer, cross-modality transformer, entity relevance, and high-order relational relevance, followed by a task-specific classifier. spaces by a cross-modality Transformer. We consistently refer to the words in text and objects in images (i.e. bounding boxes in images) as "entities" and their representations as "Entity Representations". We use the relevance between the components of the two modalities to model the relation between them. The relevance includes the relevance between their entities, as shown in the "Entity Relevance", and high-order relevance between their relations, as shown in the "Relational Relevance". We learn the representations of the affinity matrix of relevance score by convolutional layers and fully-connected layers. Finally, we predict the output by a non-linear mapping based on all the relevance representations. This architecture can help to solve tasks that need reasoning on two modalities based on their relevance. We argue that the parameters trained on one task can boost the training of the other tasks that deal with multi-modality reasoning.
In this section, we first formulate the problem. Then we describe our cross-modality relevance (CMR) model for solving the problem. The architecture, loss function, and training procedure of CMR are explained in detail. We will use the VQA and NLVR tasks as showcases.

Problem Formulation
Formally, the problem is to model a mapping from a cross-modality data sample D = {D µ } to an output y in a target task, where µ denotes the type of modality. And D µ = d µ 1 , · · · , d µ N µ is a set of entities in the modality µ. In visual question answering, VQA, the task is to predict an answer given two modalities, that is a textual question (D t ) and a visual image (D v ). In NLVR, given a textual statement (D t ) and an image (D v ), the task is to determine the correctness of the textual statement.

Representation Spaces Alignment
Single Modality Representations. For the textual modality D t , we utilize BERT (Devlin et al., 2019) as shown in the bottom-left part of Figure 1, which is a multi-layer Transformer (Vaswani et al., 2017) with three different inputs: WordPieces embeddings (Wu et al., 2016), segment embeddings, and position embeddings. We refer to all the words as the entities in the textual modality and use the BERT representations for textual single-modality representations s t 1 , · · · , s t N t . We assume to have N t words as textual entities.
For visual modality D v , as shown in the top-left part of Figure 1, Faster- RCNN (Ren et al., 2015) is used to generate regions of interest (ROIs), extract dense encoding representations of the ROIs, and predict the probability of each ROI. We refer to the ROIs on images as the visual entities We consider a fixed number, N v , of visual entities with highest probabilities predicted by Faster-RCNN each time. The dense representation of each ROI is a local latent representation of a 2048-dimensional vector (Ren et al., 2015). To enrich the visual entity representation with the visual context, we further project the vectors with feed-forward layers and encode them by a single-modality Transformer as shown in the second column in Figure 1. The visual Transformer takes the dense representation, segment embedding, and pixel position embedding (Tan and Bansal, 2019) as input and generates the single-modality In case there are multiple images, for example, NLVR data (NLVR 2 ) has two images in each example, each image is encoded by the same procedure and we keep N v visual entities per image. We refer to this as different sources of the same modality throughout the paper. We restrict all the single-modality representations to be vectors of the same dimension d. However, these original representation spaces should be aligned. Cross-Modality Alignment. To align the singlemodality representations in a uniformed representation space, we introduce a cross-modality Transformer as shown in the third column of Figure 1. All the entities are treated uniformly in the modality Transformer. Given the set of entity representations from all modalities we define the matrix with all the elements in the set Each cross-modality self-attention calculation is computed as follows (Vaswani et al., 2017) (1) where in our case the key K, query Q, and value V , all are the same tensor S, and softmax (·) normalizes along the columns. A cross-modality Transformer layer consists of a cross-modality selfattention representation followed by residual connection with normalization from the input representation, a feed-forward layer, and another residual connection normalization. We stack several cross-modality Transformer layers to get a uniform representation over all modalities. We refer to the resulting uniformed representations as the entity representation and denote the set of the entity representations of all the entities as Although the representations are still organized by their original modalities per entity, they carry the information from the interactions with the other modality and are aligned in uniform representation space. The entity representations, as the fourth column in Figure 1, alleviate the gap between representations from different modalities, as we will show in the ablation studies, and allow them to be matched in the following steps.

Entity Relevance
Relevance plays a critical role in reasoning ability, which is required in many tasks such as information retrieval, question answering, intra-and inter-modality reasoning. Relevance patterns are independent from input representation space, and can have better generalizability to unobserved data. To consider the entity relevance between two modalities D µ and D ν , the entity relevance representation is calculated as shown in Figure 1. Given entity representation matrices S µ = s µ 1 , · · · , s µ N µ ∈ R d×N µ and S ν = s ν 1 , · · · , s ν N ν ∈ R d×N ν , the relevance representation is calculated by where A µ,ν is the affinity matrix of the two modalities as shown in the right side of Figure 1. A µ,ν ij is the relevance score of ith entity in D µ and jth entity in D ν . CNN µ,ν (·) is a CNN, corresponding to the sixth column of Figure 1, which contains several convolutional layers and fully connected layers. Each convolutional layer is followed by a max-pooling layer. Fully connected layers finally map the flatten feature maps to d-dimensional vector. We refer to Φ Dµ,Dν = M (D µ , D ν ) as the entity relevance representation between µ and ν. We compute the relevance between different modalities. For the modalities considered in this work, when there are multiple images in the visual modality, we calculate the relevance representation between them too. In particular, for VQA dataset, the above setting results in one entity relevance representation: a textual-visual entity relevance Φ Dt,Dv . For NLVR 2 dataset, there are three entity relevance representations: two textual-visual entity relevance Φ Dt,Dv 1 and Φ Dt,Dv 2 , and a visual-visual entity relevance Φ Dv 1 ,Dv 2 between two images. Entity relevance representations will be flattened and joined with other features in the next layer of the network.

Relational Relevance
We also consider the relevance beyond entities, that is, the relevance of the entities' relations. This extension allows our CMR to capture higherorder relevance patterns. We consider pair-wise non-directional relations between entities in each modality and calculate the relevance of the rela-Same procedure as textural relations Entity …

Entity Relevance Affinity Matrix
Top-K

Inter-modality Importance
Intra-modality Relevance score Ranking score tions across modalities. The procedure is similar to entity relevance as shown in Figure 1.
We denote the relational representation as a nonlinear mapping R 2d → R d modeled by fullyconnected layers from the concatenation of representations of the entities in the relation r µ (i,j) = MLP µ,1 s µ i , s µ j ∈ R d . Relational relevance affinity matrix can be calculated by matching the relational representation, r µ (i,j) , ∀i = j , from different modalities. However, there will be C 2 Nµ possible pairs in each modality D µ , most of which are irrelevant. The relational relevance representations will be sparse because of the irrelevant pairs on both sides. Computing the relevance score of all possible pairs will introduce a large number of unnecessary parameters which makes the training more difficult.
We propose to rank the relation candidates (i.e. pairs) by the intra-modality relevance score and the inter-modality importance. Then we compare the top-K ranked relation candidates between two modalities as shown in Figure 2. For the intramodality relevance score, shown in the bottom left part of the figure, we estimate a normalized score based on the relational representation by a softmax layer.
To evaluate the inter-modality importance of a relation candidate, which is a pair of entities in the same modality, we first compute the relevance of each entity in text with respect to the visual objects. As shown in Figure 2, we project a vector that includes the most relevant visual object for each word, denoted this importance vector as v t . This helps to focus on words that are grounded in the visual modality. We use the same procedure to compute the most relevant words to each visual object.
Then we calculate the relation candidates importance matrix V µ by an outer product, ⊗, of the importance vectors as follows, where v µ i is the ith scalar element in v µ that corresponds to the ith entity, and A µ,ν is the affinity matrix calculated by Equation 2a.
Notice that the inter-modality importance V µ is symmetric. The upper triangular part of V µ , excluding the diagonal, indicates the importance of the corresponding elements with the same index in intra-modality relevance scores U µ . The ranking score for the candidates is the combination (here the product) of the two scores W µ (i,j) = U µ (i,j) ×V µ ij . We select the set of top-K ranked candidate relations K µ = {κ 1 , κ 2 , · · · , κ K }. We reorganize the representation of the top-K relations as R µ = [r µ κ 1 , · · · r µ κ K ] ∈ R d×K . The relational relevance representation between K µ and K ν can be calculated similar to the entity relevance representations as shown in Figure 1. (5) M (K µ , K ν ) has its own parameters which results in a d-dimensional feature space Φ Kµ,Kν .
In particular, for VQA task, the above setting results in one relational relevance representation: a textual-visual relevance M (K t , K v ). For NLVR task, there are three entity relevance representations: two textual-visual relational relevance M (K t , K v 1 ) and M (K t , K v 2 ), and a visual-visual relational relevance M (K v 1 , K v 2 ) between two images. Relational relevance representations will be flattened and joined with other features in the next layers of the network.

Training
End-to-end Training. CMR can be considered as an end-to-end relevance representation extractor. We simply predict the output y from a specific task with the final feature Φ with a differentiable regression or classification function. The gradient of the loss function is backpropagated to all the components in CMR to penalize the prediction and adjust the parameters. We freeze the parameters of the basic feature extractors, namely BERT for textual modality and Faster-RCNN for visual modality. The parameters of the following parts will be updated by gradient descent: single modality Transformers (except BERT), the cross-modality Transformers, CNN Dµ,Dν (·), CNN Kµ,Kν (·), MLP µ,1 (·), MLP µ,2 (·) for all modalities and modality pairs, and the task-specific classifier MLP Φ (Φ).
The VQA task can be formulated as a multi-class classification that chooses a word to answer the question. We apply a softmax classifier on Φ and penalize with the cross-entropy loss. For NLVR 2 dataset, the task is binary classification that determines whether the statement is correct regarding the images. We apply a logistic regression on Φ and penalize with the cross-entropy loss.
Pre-training Strategy. To leverage the pre-trained parameters of our cross-modality Transformer and relevance representations, we use the following training settings. For all tasks, we freeze the parameters in BERT and faster-RCNN. We used pretrained parameters in the (visual) single modality Transformers as proposed by (Tan and Bansal, 2019) and leave them being fine-tuned with the following procedure. Then we randomly initialize and train all the parameters in the model on NLVR with NLVR 2 dataset. After that, we keep and fine-tune all the parameters on the VQA task with the VQA v2.0 dataset. (See data description Section 4.1.) In this way, the parameters of the cross-modality Transformer and relevance representations, pre-trained by NLVR 2 dataset, are reused and fine-tuned on the VQA dataset. Only the final task-specific classifier with the input features Φ is initialized randomly. The pre-trained cross-modality Transformer and relevance representations help the model for VQA to converge faster and achieve a competitive performance compared to the state-of-the-art results.

Data Description
NLVR 2 (Suhr et al., 2018) is a dataset that aims to joint reasoning about natural language descriptions and related images. Given a textual statement and a pair of images, the task is to indicate whether the statement correctly describes the two images. NLVR 2 contains 107, 292 examples of sentences paired with visual images and designed to emphasize semantic diversity, compositionality, and visual reasoning challenges.

Implementation Details
We implemented CMR using Pytorch 2 . We consider the 768-dimension single-modality representations. For textural modality, the pre-trained BERT "base" model (Devlin et al., 2019) is used to generate the single-modality representation. For visual modality, we use Faster-RCNN pre-trained by Anderson et al., followed by a five-layers Transformer. Parameters in BERT and Faster-RCNN are fixed. For each example, we keep 20 words as textual entities and 36 ROIs per image as visual entities. For the relational relevance, top-10 ranked pairs are used. For each relevance CNN, CNN Dµ,Dν (·) and CNN Kµ,Kν (·), we use two convolutional layers, each of which is followed by a max-pooling, and fully connected layers. For the relational representations and their intra-modality relevance score, MLP µ,1 (·) and MLP µ,2 (·), we use one hidden layer for each. The task-specific classifier MLP Φ (Φ) contains three hidden layers. The model is optimized using the Adam optimizer with α = 10 −4 , β 1 = 0.9, β 2 = 0.999, = 10 −6 . The model is trained with a weight decay 0.01, a max gradient normalization 1.0, and a batch size of 32.

Baseline Description
VisualBERT (Li et al., 2019b) is an End-to-End model for language and vision tasks, consists of

Results
NLVR 2 : The results of NLVR task are listed in Table 1. Transformer based models (Visual-BERT, LXMERT, and CMR) outperform other models (N2NMN (Hu et al., 2017), MAC (Hudson and Manning, 2018), and FiLM (Perez et al., 2018)) by a large margin. This is due to the strong pre-trained single-modality representations and the Transformers' ability to reshape the representations that align the spaces. Furthermore, CMR shows the best performance compared to all Transformerbased baseline methods and achieves state-of-theart. VisualBERT and CMR have similar crossmodality alignment approach. CMR outperforms VisualBERT by 12.4%. The gain mainly comes from entity relevance and relational relevance that model the relations. VQA v2.0: In Table 2, we show the comparison with published models excluding the ensemble ones. Most competitive models are based on Transformers (ViLBERT (Lu et al., 2019), Visu-alBERT (Li et al., 2019b), VL-BERT (Su et al., 2020), LXMERT (Tan and Bansal, 2019), and CMR). BUTD , ReGAT (Li et al., 2019a), and BAN (Kim et al., 2018)   achieve the best performance on Number questions. This is because Number questions require the ability to count numbers in one modality while CMR focuses on modeling relations between modalities. Performance on counting might be improved by explicit modeling of quantity representations. CMR also achieves the best overall accuracy. In particular, we can see a 2.3% improvement over Visual-BERT (Li et al., 2019b), as in the above mentioned NLVR 2 results. This shows the significance of the entity and relational relevance.
Another observation is that, if we train CMR for VQA task from scratch with random initialization while still use the fixed BERT and Faster-RCNN, the model converges after 20 epochs. As we initialize the parameters with the model trained on NLVR 2 , it takes 6 epochs to converge. The significant improvement of convergence speed indicates that the optimal model for VQA is close to that of NLVR.

Model Size
To investigate the influence of model sizes, we empirically evaluated CMR on NLVR 2 with various sets of Transformers sizes which contain the most parameters of the model. All other details are kept the same as descriptions in Section 4.2. Textual Transformer remains 12 layers because it is the pre-trained BERT. Our model contains 285M parameters. Among these parameters, around 230M parameters belong to pre-trained BERT and Transformer. Table 3 shows the results. As we increase the number of layers in the visual Transformer and the cross-modality Transformer, it tends to improve accuracy. However, the performance becomes stable when there are more than five layers. We choose five layers of visual Transformer and cross-modality Transformer in other experiments.

Ablation Studies
To better understand the influence of each part in CMR, we perform the ablation study. Table 4 shows the performances of four variations on NLVR 2 . Effect of Single Modality Transformer. We remove both textual and visual single-modality Transformers and map the raw input with a linear transformation to d-dimensional space instead. Notice that the raw input of textual modality is the Word-Pieces (Wu et al., 2016) embeddings, segment embeddings, and the position embeddings of each word, while that of visual modality is the 2048dimension dense representation of each ROI extracted by Faster-RCNN. It turns out that removing single-modality Transformers decreases the accuracy by 9.0%. Single modality Transformers play a critical role in producing a strong contextualized representation for each modality.
Effect of Cross-Modality Transformer. We remove the cross-modality Transformer and use single-modality representations as entity representations. As shown in Table 4, the model degenerates dramatically, and the accuracy decreases by 16.2%. The huge accuracy gap demonstrates the unparalleled contribution of the cross-modality Transformer to aligning representation spaces from input modalities. Effect of Entity Relevance. We remove the entity relevance representation Φ Dµ,Dν from the final feature Φ. As shown in Table 4, the test accuracy is reduced by 5.4%. This is a significant difference of performance among Transformer based models (Li et al., 2019b;Lu et al., 2019;Tan and Bansal, 2019).
The bird on the branch is looking to left Figure 3: The entity affinity matrix between textual (rows) and visual (columns) modalities. The darker color indicates the higher relevance score. The ROIs with maximum relevance score for each word are shown paired with the words. Figure 4: The relation ranking score of two example sentence. The darker color indicates the higher ranking score.
To highlight the significance of entity relevance, we visualize an example affinity matrix in Figure 3. The two major entities, "bird" and "branch", are matched perfectly. More interestingly, the three ROIs which are matching the phrase "looking to left" capture an indicator (the beak), a direction (left), and the semantic of the whole phrase. Effect of Relational Relevance. We remove the entity relevance representation Φ Kµ,Kν from the final feature Φ. A 2.5% decrease in test accuracy is observed in Table 4. We argue that CMR models high-order relations, which are not captured in entity relevance, by modeling relational relevance. We present two examples of textual relation ranking scores in Figure 4. The learned ranking score highlights the important pairs, for example "goldtop", "looking -left", which describe the important relations in textual modality.

Conclusion
In this paper, we propose a novel cross-modality relevance (CMR) for language and vision reasoning. Particularly, we argue for the significance of relevance between the components of the two modalities for reasoning, which includes entity relevance and relational relevance. We propose an end-to-end cross-modality relevance framework that is tailored for language and vision reasoning. We evaluate the proposed CMR on NLVR and VQA tasks. Our approach exceeds the state-of-the-art on NLVR 2 and VQA v2.0 datasets. Moreover, the model trained on NLVR 2 boosts the training of VQA v2.0 dataset. The experiments and the empirical analysis demonstrate CMR's capability of modeling relational relevance for reasoning and consequently its better generalizability to unobserved data. This result indicates the significance of relevance patterns. Our proposed architectural component for capturing relevance patterns can be used independently from the full CMR architecture and is potentially applicable for other multi-modal tasks.