Structure Learning for Neural Module Networks

Neural Module Networks, originally proposed for the task of visual question answering, are a class of neural network architectures that involve human-specified neural modules, each designed for a specific form of reasoning. In current formulations of such networks only the parameters of the neural modules and/or the order of their execution is learned. In this work, we further expand this approach and also learn the underlying internal structure of modules in terms of the ordering and combination of simple and elementary arithmetic operators. We utilize a minimum amount of prior knowledge from the human-specified neural modules in the form of different input types and arithmetic operators used in these modules. Our results show that one is indeed able to simultaneously learn both internal module structure and module sequencing without extra supervisory signals for module execution sequencing. With this approach, we report performance comparable to models using hand-designed modules. In addition, we do a analysis of sensitivity of the learned modules w.r.t. the arithmetic operations and infer the analytical expressions of the learned modules.


Introduction
Designing general purpose reasoning modules is one of the central challenges in artificial intelligence. Neural Module Networks [2] were introduced as a general purpose visual reasoning architecture and have been shown to work well for the task of visual question answering [3,19,27,26]. They use dynamically composable modules which are then assembled into a layout based on syntactic parse of the question. The modules take as input the images or the attention maps 2 and return attention maps or labels as output. In [9], the layout prediction is relaxed by learning a layout policy with a sequence-to-sequence RNN. This layout policy is jointly trained along with the parameters of the modules. The model proposed in [8] avoids the use of reinforcement learning to train the layout predictor, and uses soft program execution to learn both layout and module parameters jointly.
A fundamental limitation of these previous modular approaches to visual reasoning is that the modules need to be hand-specified. This might not be feasible when one has limited knowledge of the kinds of questions or associated visual reasoning required to solve the task. In this work, we present an approach to learn the module structure, along with the parameters of the modules in an end-to-end differentiable training setting. Our proposed model, Learnable Neural Module Network (LNMN), learns the structure of the module, the parameters of the module, and the way to compose the modules based on just the regular task loss. Our results show that we can learn the structure of the modules automatically and still perform comparably to hand-specified modules. We want to highlight the fact that our goal in this paper is not to beat the performance of the hand-specified modules since they are specifically engineered for the task. Instead, our goal is to explore the possibility of designing general-purpose reasoning modules in an entirely data-driven fashion.

Background
In this section, we describe the working of the Stack-NMN model [8] as our proposed LNMN model uses this as the base model. The Stack-NMN model is an end-to-end differentiable model for the task of Visual Question Answering and Referential Expression Grounding [28]. It addresses a major drawback of prior visual reasoning models in the literature that compositional reasoning is implemented without the need of supervisory signals for composing the layout at training time. It consists of several hand-specified modules (namely Find, Transform, And, Or, Filter, Scene, Answer, Compare and NoOp) which are parameterized, differentiable, and implement common routines needed in visual reasoning and learns to compose them without strong supervision. The implementation details of these modules are given in Appendix A.2 (see Table 8). The different sub-components of the Stack-NMN model are described below.

Module Layout Controller
The structure of the controller is similar to the one proposed in [10]. The controller first encodes the question using a bi-directional LSTM [7]. Let [h 1 , h 2 , ..., h S ] denote the output of Bi-LSTM at each time-step of the input sequence of question words. Let q denote the concatenation of final hidden state of Bi-LSTM during the forward and backward passes. q can be considered as the encoding of the entire question. The controller executes the modules iteratively for T times. At each time-step, the updated query representation u is obtained as: Produce updated stack and stack pointer: Algorithm 1: Operation of Module Layout Controller and Memory Stack.
controller parameters. c t−1 is the textual parameter from the previous time step. The controller has two outputs viz. the textual parameter at step t (denoted by c t ) and the attention on each module (denoted by vector w (t) ). The controller first predicts an attention cv t,s on each of the words of the question and then uses this attention to do a weighted average over the outputs of the Bi-LSTM.
where, W 3 ∈ R 1×d is another controller parameter. The attention on each module w (t) is obtained by feeding the query representation at each time-step to a Multi-layer Perceptron (MLP).

Operation of Memory Stack for storing attention maps
In order to answer a visual reasoning question, the model needs to execute modules in a tree-structured layout. In order to facilitate this sort of compositional behavior, a differentiable memory pool to store and retrieve intermediate attention maps is used. A memory stack (with length denoted by L) stores H × W dimensional attention maps, where H and W are the height and width of image feature maps respectively. Depending on the number of attention maps required as input by the module, it pops them from the stack and later pushes the result back to the stack. The model performs soft module execution by executing all modules at each time step. The updated stack and stack pointer at each subsequent time-step are obtained by a weighted average of those corresponding to each module using the weights w (t) predicted by the module controller. This is illustrated by the equations below: Here, A (t) m and p (t) m denote the stack and stack pointer respectively, after executing module m at time-step t. A (t) and p (t) denote the stack and stack pointer obtained after the weighted average of those corresponding to all modules at previous time-step (t − 1). The working of module layout controller and its interfacing with memory stack is illustrated in Algorithm 1. The implementation details of operation of the stack are shown in Appendix (see Algorithm 3).

Final Classifier
At each time-step of module execution, the weighted average of output of the Answer modules is called memory features (denoted by f m denotes the output of module m at time t. The memory features are given as one of the inputs to the Answer modules at the next time-step. The memory features at the final time-step are concatenated with the question representation, and then fed to an MLP to obtain the logits.

Learnable Neural Module Networks
In this section, we introduce Learnable Neural Module Networks (LNMNs) for visual reasoning, which extends Stack-NMN. However, the modules in LNMN are not hand-specified. Rather, they are generic modules as specified below.

Structure of the Generic Module
The cell (see Figure 1) denotes a generic module, which we suppose can span all the required modules for a visual reasoning task. Each cell contains a certain number of nodes. The function of a node (denoted by O) is to perform a weighted sum of outputs of different arithmetic operations applied on the input feature maps x 1 and x 2 . Let α = σ(α) denote the output of softmax function applied to the vector α such that All of the above operations (min, max, +, ) are element-wise operations. The last two non-standard functions are defined as: We consider two broad kinds of modules: (i) Attention modules which output an attention map (ii) Answer modules which output memory features to be stored in the memory. Among each of these two categories, there is a finer categorization:

Generic Module with 3 inputs
This module type receives 3 inputs (i.e. image features, textual parameter, and a single attention map) and produces a single output. The first node receives input from the image feature (I) and the attention map (popped from the memory stack). The second node receives input from the textual parameter followed by a linear layer (W 1 c txt ), and the output of the first node.

Generic Module with 4 inputs
This module type receives 4 inputs (i.e. image features, textual parameter and two attention maps) and produces a single output. The first node receives the two attention maps, each of which are popped from the memory stack, as input. The second node receives input from the image features along with the output of the first node. The third node receives input from the textual parameter followed by a linear layer, and the output of the second node.
For the Attention modules, the output of the final node is converted into a single-channel attention map using a 1 × 1 convolutional layer. For the Answer modules, the output of the final node is summed over spatial dimensions, and the resulting feature vector is concatenated with memory features of previous time-step and textual parameter features, fed to a linear layer to output memory features. The schematic diagrams of the Attention module with four inputs and Answer modules are given in the Appendix A.1 (see Figures 5, 6, 7).

Overall structure
The structure of our end-to-end model extends Stack-NMN in that we specify each module in terms of the generic module (defined in Section 3.1). We experiment with three model ablations in terms of number of modules for each type being used. See Table 3 for details 3 . We train the module structure parameters (denoted by α = α m,k for k th node of module m) and the weight parameters (W) by adopting alternating gradient descent steps in architecture and weight spaces respectively. For a particular epoch, the gradient step in weight space is performed on each training batch, and the gradient step in architecture space is performed on a batch randomly sampled from the validation set.
while not converged do 1. Update weights W by descending 2. Update architecture α by descending for k th node of module m, W denotes the collection of weight parameters of modules and all other non-module parameters.
This is done to ensure that we find an architecture corresponding to the modules which has a low validation loss on the updated weight parameters. This is inspired by the technique used in [18] to learn monolithic architectures like CNNs and RNNs in terms of basic building blocks (or cells). Algorithm 2 illustrates the training algorithm. Here, L train (W, α) and L val (W, α) denote the training loss and validation loss on the combination of parameters (W, α) respectively. For the gradient step on the training batch, we add an additional loss term to initially maximize the entropy of w (t) and gradually anneal the regularization coefficient (λ w ) to the opposite sign (which minimizes the entropy of w (t) towards the end of training). The value of λ w varies linearly from 1.0 to 0.0 in the first 20 epochs and then steeply decreases to −1.0 in next 10 epochs. The trend of variation of λ w is shown in Appendix (see Figure 4). For the gradient steps in the architecture space, we add an additional loss term ( l 2 ) [11] to encourage the sparsity of module structure parameters (α) after the softmax activation. 3 1 NoOp module is included by default in all ablations.

Experiments
We train our model on the CLEVR visual reasoning task. CLEVR [13] is a synthetic dataset for visual reasoning containing around 700K examples, and has become the standard benchmark to test visual reasoning models. It contains questions that test visual reasoning abilities such as counting, comparing, logical reasoning based on 3D shapes like cubes, spheres, and cylinders of varied shades. A typical example question and image pair from this dataset is given in Appendix (see Figure 3). The results on CLEVR test set are reported in Table 1. Some ablations of the model are shown in Table 2. We use the pre-trained CLEVR model to fine-tune the model on CLEVR-Humans dataset. The latter is a dataset of challenging human-posed questions based on a much larger vocabulary on the same CLEVR images. The corresponding results are shown in Table 1 (see last column). In addition, we experiment on VQA v1 [3] and VQA v2 [6] which are VQA datasets containing natural images. The results for VQA v1 and VQA v2 are shown in Table 4.    The detailed accuracy for each question sub-type for the VQA datasets is given in Appendix A.4 (see Tables 9 and 10). We use Adam [15] as the optimizer for the weight parameters with a learning rate of 1e−4, (β 1 , β 2 ) = (0.9, 0.999) and no weight decay. For the module network parameters, we use the same optimizer with a different learning rate 3e−4, (β 1 , β 2 ) = (0.5, 0.999) and a weight decay of 1e−3. The value of λ op is set to 1.0.

Results
The comparison of CLEVR overall accuracy shows that our model (LNMN (9 modules)) receives only a slight dip (1.53%) compared to the Stack-NMN model. We also experiment with other variants of our model in which we increase the number of Answer modules (LNMN (11 modules)) and/or Attention modules (LNMN (14 modules)). The LNMN (11 modules Table 2 show that a naive concatenation of all inputs to a module (or cell) results in a poor performance (around 47 %). Thus, the structure we propose to fuse the inputs plays a key role in model performance. When we replace the α vector for each node by a one-hot vector during inference, the drop in accuracy is only 1.79% which shows that the learned distribution over operation weights peaks over a specific operation which is desirable.

Measuring the sensitivity of modules
We use an attribution technique called Integrated Gradients [31] to study the impact of module struc- Please note that attributions are defined relative to an uninformative input called the baseline. We use a vector of all zeros as the baseline (denoted by (α m,k i ) ). Table 5 shows the results for this experiment.
The module structure parameters (α parameters) of the Answer modules have their attributions to the final probability around 1-2 orders of magnitudes higher than rest of the modules. The higher influence of Answer modules can be explained by the fact that they receive the memory features from the previous time-step and the classifier receives the memory features of the final time-step. The job of Attention modules is to utilize intermediate attention maps to produce new feature maps which are used as input by the Answer modules.  (LNMN (11 modules)). For each module, each row denotes the α = σ(α) parameters of the corresponding node.

Visualization of module network parameters
In order to better interpret the individual contributions from each of the arithmetic operators to the modules, we plot them as color-maps for each type of module. The resulting visualizations are shown in Figure 2 for LNMN (11 modules). It is clear from the figure that the operation weights (or α parameter) are approximately one-hot for each node. This is necessary in order to learn modules which act as composition of elementary operators on input feature maps rather than a mixture of operations at each node. The corresponding visualizations for LNMN (9 modules) and LNMN (14 modules) are given in Figure 8 and Figure 9 respectively (all of which are given in the Appendix A.3). The analytical expressions of modules learned by LNMN (11 modules) are shown in Table 6. The diversity of modules as given in their equations indicates that distinct modules emerge from training.

Measuring the role of individual arithmetic operators
Each module (aka cell) contains nodes which involves use of six elementary arithmetic operations (i.e. min, max, sum, product, choose_1 and choose_2). We zero out the contribution to the node output for one of the arithmetic operations for all nodes in all modules and observe the degradation in the CLEVR validation accuracy 4 . The results of this study are shown in Table 7. The trend of overall accuracy shows that removing max and product operators results in maximum drop in overall accuracy (∼ 50%). Other operators like min, sum and choose_1 result in minimal drop in overall accuracy.

Related Work
Neural Architecture Search: Neural Architecture Search (NAS) is a technique to automatically learn the structure and connectivity of neural networks rather than training human-designed architectures. In [33], a recurrent neural network (RNN) based controller is used to predict the hyper-parameters of a CNN such as number of filters, stride, kernel size etc. They used REINFORCE [32] to train the controller with validation set accuracy as the reward signal. As an alternative to reinforcement learning, evolutionary algorithms [30] have been used to perform architecture search in [25,21,17,24]. Recently, [18] proposed a differentiable approach to perform architecture search and reported success in discovering high-performance architectures for both image classification and language modeling. [16] proposes an EM style algorithm to learn black-box modules and their layout for image recognition and language modeling tasks.   Visual Reasoning Models: Among the end-to-end models for the task of visual reasoning, FiLM [23] uses Conditional Batch Normalization (CBN) [4,5] to modulate the channels of input convolutional features in a residual block. [10] obtains the features by iteratively applying a Memory-Attention-Control (MAC) cell that learns to retrieve information from the image and aggregate the results into a recurrent memory. [29] constructs the feature representation by taking into account the relational interactions between objects of the image. With regards to the modular approaches, [2] proposes to compose neural network modules (with shared parameters) for each input question based on layout predicted by syntactic parse of the question. [1] extends this approach to question-answering in a database domain. In [9], the layout prediction is relaxed by learning a layout policy with a sequenceto-sequence RNN. This layout policy is jointly trained along with the parameters of modules. In [14], the modules are residual blocks (convolutional), they learn the program generator separately and then fine-tune it along with the modules. TbD-net [20] builds upon the End-to-End Module Networks [9] but makes an important change in that the proposed modules explicitly utilize attention maps passed as inputs instead of learning whether or not to use them. This results in more interpretability of the modules since they perform specific functions.
Visual Question Answering: Visual question answering requires a learning model to answer sophisticated queries about visual inputs. Significant progress has been made in this direction to design neural networks that can answer queries about images. This can be attributed to the availability of relevant datasets which capture real-life images like DAQUAR [19], COCO-QA [26] and most recently VQA (v1 [3] and v2 [6]). The most common approaches [27,22] to this problem include construction of a joint embedding of question and image and treating it as a classification problem over the most frequent set of answers. Recent works [12,13] have shown that the neural networks tend to exploit biases in the datasets without learning how to reason.

Conclusion
We have presented a differentiable approach to learn the modules needed in a visual reasoning task automatically. With this approach, we obtain results comparable to an analogous model in which modules are hand-specified for a particular visual reasoning task. In addition, we present an extensive analysis of the degree to which each module influences the prediction function of the model, the effect of each arithmetic operation on overall accuracy and the analytical expressions of the learned modules. In the future, we would like to benchmark this generic learnable neural module network with various other visual reasoning and visual question answering tasks.   (none) attention a out = conv 2 (conv 1 (x) W c) Transform a attention a out = conv 2 (conv 1 (x) W 1 (a x) W 2 c) And a 1 , a 2 attention a out = minimum(a 1 , a 2 ) Or a 1 , a 2 attention a out = maximum(a 1 , a 2 ) Filter a attention a out = And(a, Find()), i.e. reusing Find and And Scene (none) attention a out = conv 1 (x) Answer a answer y = W T 1 (W 2 (a x) W 3 c) Compare a 1 , a 2 answer y = W T 1 (W 2 (a 1 x) W 3 (a 2 x) W 4 c) NoOp (none) (none) (does nothing) Table 8: Neural modules used in [8]. The modules take image attention maps as inputs, and output either a new image attention a out or a score vector y over all possible answers ( is elementwise multiplication; is sum over spatial dimensions).  Table 10: Test Accuracy on VQA v2 [6]