Cascaded Mutual Modulation for Visual Reasoning

Visual reasoning is a special visual question answering problem that is multi-step and compositional by nature, and also requires intensive text-vision interactions. We propose CMM: Cascaded Mutual Modulation as a novel end-to-end visual reasoning model. CMM includes a multi-step comprehension process for both question and image. In each step, we use a Feature-wise Linear Modulation (FiLM) technique to enable textual/visual pipeline to mutually control each other. Experiments show that CMM significantly outperforms most related models, and reach state-of-the-arts on two visual reasoning benchmarks: CLEVR and NLVR, collected from both synthetic and natural languages. Ablation studies confirm the effectiveness of CMM to comprehend natural language logics under the guidence of images. Our code is available at https://github.com/FlamingHorizon/CMM-VR.


Introduction
It is a challenging task in artificial intelligence to perform reasoning with both textual and visual inputs. Visual reasoning task is designed for researches in this field. It is a special visual question answering (VQA) (Antol et al., 2015) problem, requiring a model to infer the relations between entities in both image and text, and generate a textual answer to the question correctly. Unlike other VQA tasks, questions in visual reasoning often contain extensive logical phenomena, and refer to multiple entities, specific attributes and complex relations. Visual reasoning datasets such as CLEVR (Johnson et al., 2017a) and NLVR (Suhr et al., 2017) are built on unbiased, synthetic images, with either complex synthetic questions or natural-language descriptions, facilitating indepth analyses on reasoning ability itself. Figure 1: Connections and differences between previous "program-generating" works and our model: other models generate/control multi-step image-comprehension processes with single question representation, while we put more attention on language logics and let multi-modal information modulate each other in each step. The question and image are taken as a visual-reasoning example from CLEVR dataset.
Most previous visual reasoning models focus on using the question to guide the multi-step computing on visual features (which can be defined as a image-comprehension "program"). Neural Module Networks (NMN) (Andreas et al., 2016a,b) and Program Generator + Execution Engine (PG+EE) (Johnson et al., 2017b) learn to combine specific image-processing modules, guided by question semantics. Feature-modulating methods like FiLM (De Vries et al., 2017;Perez et al., 2018) control image-comprehension process using modulation-parameters generated from the question, allowing models to be trained end-to-end. However, the image-comprehension program in visual reasoning tasks can be extremely long and sophisticated. Using a single question representation to generate or control the whole imagecomprehension process raises difficulties in learning. Moreover, since information comes from multiple modalities, it is not intuitive to assume that one (language) is the "program generator", and the other (image) is the "executor". One way to avoid making this assumption is to perform multiple steps of reasoning with each modality being generator and executor alternately in each step. For these two reasons, we propose Cascaded Mutual Modulation (Figure 1), a novel visual reasoning model to solve the problem that previous "program-generating" models lack a method to use visual features to guide multi-step reasoning on language logics. CMM reaches state-of-thearts on two benchmarks: CLEVR (complex synthetic questions) and NLVR (natural-language). Perez et al. (2018) proposed FiLM as an endto-end feature-modulating method. The original ResBlock+GRU+FiLM structure uses single question representation, and conditions all image-modulation-parameters on it, without sufficiently handling multi-step language logics. In contrast, we modulate both image and language features alternately in each step, and condition the modulation-parameters on the representations from previous step. We design an image-guided language attention pipeline and use it in combination with FiLM in our CMM framework, and significantly outperform the original structure.

Related Work
Other widely-cited works on CLEVR/NVLR include Stacked Attention Networks (SAN) (Yang et al., 2016), NMN (Andreas et al., 2016b), N2NMN (Hu et al., 2017), PG+EE (Johnson et al., 2017b) and Relation Networks (RN) (Santoro et al., 2017). The recent CAN model (Hudson and Manning, 2018) also uses multiple question representations and has strong performances on CLEVR. However, these representations are not modulated by the visual part as in our model.
In other VQA tasks, DAN (Nam et al., 2017) is the only multi-step dual framework related to ours. For comparison, in every time step, DAN computes textual and visual attention in parallel with the same key-vector, while we perform textual attention and visual modulation (instead of attention) in a cascaded manner.

Model
We review and extend FiLM in Section 3.1-3.2, and introduce CMM model in Section 3.3-3.4. Perez et al. (2018) proposed Feature-wise Linear Modulation (FiLM), an affine transformation on intermediate outputs of a neural network (v stands for visual):

Visual Modulation
where F i,c is the c-th feature map (C in total) generated by Convolutional Neural Networks (CNN) in the i-th image-comprehension step. Modulation-parameters γ i,c and β i,c can be conditioned on any other part of network (in their work the single question representation q). If the output tensor V i of a CNN block is of size C × H × W , then F i,c is a single slice of size 1×H ×W . H and W are the height and width of each feature map.
Unlike (Perez et al., 2018), in each step i, we compute a new question vector q i . Modulationparameters γ i and β i (C × 1 vectors, γ i = [γ i,1 , ...,γ i,C ], etc.) are conditioned on the previous question vector q i−1 instead of a single q: MLP stands for fully connected layers with linear activations. The weights and biases are not shared among all steps.

Language Modulation
In each step i, we also apply FiLM to modulate every language "feature map". If the full question representation is a D × T matrix, a question "feature map" f i,d is defined as a 1 × T vector gathering T features along a single dimension. D is the hidden-state dimension of language encoder, and T is a fixed maximum length of word sequences.
where l stands for language. Concatenated modulation-parameters γ i and β i (D × 1) are conditioned on the visual features V i computed in the same step: where g m (Section 3.4) is an interaction function that converts 3-d visual features to languagemodulation-parameters. The weights in g m are shared among all N steps.

Cascaded Mutual Modulation
The whole pipeline of our model is built up with multiple steps. In each step i (N in total), previous question vector q i−1 and the visual features V i−1 are taken as input; q i and V i are computed as output. Preprocessed questions/images are encoded by language/visual encoders to form q 0 and V 0 .
In each step, we cascade a FiLM-ed ResBlock with a modulated textual-attention. We feed V i−1 into the ResBlock modulated by parameters from q i−1 to compute V i , and then control the textual attention process with modulation-parameters from V i , to compute the new question vector q i . (Figure 2, middle).
Each ResBlock contains a 1 × 1 convolution, a 3 × 3 convolution, a batch-normalization (Ioffe and Szegedy, 2015) layer before FiLM modulation, followed by a residual connection . (Figure 2, right. We keep the same Res-Block structure as (Perez et al., 2018)). To be consistent with (Johnson et al., 2017b;Perez et al., 2018), we concatenate the input visual features V i−1 of each ResBlock i with two "coordinate feature maps" scaled from −1 to 1, to enrich representations of spatial relations. All CNNs in our model use ReLU as activation functions; batchnormalization is applied before ReLU.
After the ResBlock pipeline, we apply language modulation on the full language features {h 1 , ..., h T } (D × T ) conditioned on V i and rewrite along the time dimension, yielding: and compute visual-guided attention weights: and weighted summation over time: In equation (6), W att i ∈ R 1×D and b att i ∈ R 1×1 are network weights and bias; h t is the t-th language state vector (D × 1), computed using a bidirectional GRU (Chung et al., 2014) from word embeddings {w 1 , ..., w T }. In each step i, the language pipeline does not re-compute h t , but remodulate it as e i,t instead. (Figure 2, top.)

Feature Projections
We use a function g p to project the last visual features V N into a final representation: g p includes a convolution with K 1 × 1 kernels, a batch-normalization afterwards, followed by global max pooling over all pixels (K = 512).
We also need a module g m (equation (4)) to compute language-modulations with V i , since V i is 3-d features (not a weighted-summed vector as in traditional visual-attention). We choose g m to have the same structure as g p , except that K equals to the total number of modulation-parameters in each step. This design is critical (Section 4.3).
We use a fully connected layer with 1024 ReLU hidden units as our answer generator. It takes u f inal as input, and predicts the most probable answer in the answer vocabulary.

Experiments
We are the first to achieve top results on both datasets (CLEVR, NLVR) with one structure. See Appendix for more ablation and visualization results.

CLEVR
CLEVR (Johnson et al., 2017a) is a commonlyused visual reasoning benchmark containing 700,000 training samples, 150,000 for validation and test. Questions in CLEVR cover several typical elements of reasoning: counting, comparing, querying the memory, etc. Many well-designed models on VQA have failed on CLEVR, revealing the difficulty to handle the multi-step and compositional nature of logical questions. On CLEVR dataset, we embed the question words into a 200-dim continuous space, and use a bi-directional GRU with 512 hidden units to generate 1024-dim question representations. Questions are padded with NULL token to a maximum length T = 46. As the first-step question vector in CMM, q 0 can be arbitrary RNN hidden state in the set {h 1 , ..., h T } (Section 3.3). We choose the one at the end of the unpadded question.
In each ResBlock, the feature map number C is set to 128. Images are pre-processed with a ResNet101 network pre-trained on ImageNet (Russakovsky et al., 2015) to extract 1024 × 14 × 14 visual features (this is also common practice on CLEVR). We use a trainable one-layer CNN with 128 kernels (3×3) to encode the extracted features into V 0 (128 × 14 × 14). Convolutional paddings are used to keep the feature map size to be 14 × 14 through the visual pipeline. We train the model with an ADAM (Kingma and Ba, 2014) optimizer using a learning rate of 2.5e-4 and a batch-size of 64 for about 90 epoches, and switch to an SGD with the same learning rate and 0.9 momentum, fine-tuning for another 20 epoches. SGD generally brings around 0.3 points gains to CMM on CLEVR.
We achieve 98.6% accuracy with single model (4-step), significantly better than FiLM and other related work, only slightly lower than CAN, but CAN needs at least 8 model-blocks for >98% (and 12 for best). We achieve state-of-the-art of 99.0% with ensemble of 4/5/6 step CMM models. Table  1 shows test accuracies on all types of questions. The main improvements over program-generating models come from "Counting" and "Compare Numbers", indicating that CMM framework significantly enhances language (especially numeric) reasoning without sophisticated memory design like CAN.

NLVR
NLVR (Suhr et al., 2017) is a visual reasoning dataset proposed by researchers in NLP field. NLVR has 74,460 samples for training, 5,940 for validation and 5,934 for public test. In each sample, there is a human-posed natural language description on an image with 3 sub-images, and requires a false/true response.
We use different preprocessing methods on NLVR. Before training, we reshape NLVR images into 14 × 56 raw pixels and use them directly as visual inputs V 0 . For language part, we correct some obvious typos among the rare words (frequency < 5) in the training set, and pad the sentences to a maximum length of 26. Different from CLEVR, LSTM works better than GRU on the real-world questions. For training, we use ADAM with a learning rate of 3.5e-4 and a batch-size of 128 for about 200 epoches, without SGD finetuning.
Our model (3-step, 69.9%) outperforms all proposed models on both validation and public test set, showing that CMM is also suitable for realworld languages (Table 2).

Ablation Studies
We list CMM ablation results in Table 3. Ablations on CLEVR show that CMM is robust to step number but sensitive to g m structure because it's the key to multi-modal interaction. Section 3.4 is temporarily a best choice. CMM performs over 7point higher than FiLM on NLVR in a setting of same hyper-parameters and ResBlocks, showing the importance of handling language logics (see also difficult CLEVR subtasks in Table 1).  Table 3: Ablation studies on CLEVR/NLVR. g m -CNN means using 2-layer-CNN with 3 × 3 kernels, followed by concatenation and MLP, as g m . BN means batchnormalization in g m . NS means not sharing weights. "FiLM-hyp" uses all the same hyper-parameters as the 3-step CMM (both use GRU as question encoder).

A Case Study
We select an image-question pair from the validation set of CLEVR for visualization. In Table  4, we visualize the multi-step attention weights on question words, and the distribution of argmax position in the global max-pooling layer of g p (equiv-According to NLVR rules, we will run on the unreleased test set (Test-U) in the near future.  alent to the last visual "attention map" although there isn't explicit visual attention in our imagecomprehension pipeline). On the bottom right is the original image, and on the top right is the distribution of argmax positions in the global maxpooling, multiplied by the original image. Our model attends to phrases "same shape as" and "brown object" in the first two reasoning steps. These phrases are meaningful because "same shape as" is the core logic in the question, and "brown object" is the key entity to generating the correct answer. In the last step, the model attends to the phrase "is there". This implicitly classifies the question into question-type "exist", and directs the answer generator to answer "no" or "yes". The visual map, guided by question-based modulation parameters, concentrates on the green and brown object correctly.
The result shows that visual features can guide the comprehension of question logics with textual modulation. On the other hand, question-based modulation parameters enable the ResBlocks to filter out irrelative objects.

Conclusion
We propose CMM as a novel visual reasoning model cascading visual and textual modulation in each step. CMM reaches state-of-the-arts on visual reasoning benchmarks with both synthetic and real-world languages.