History for Visual Dialog: Do we really need it?

Visual Dialogue involves “understanding” the dialogue history (what has been discussed previously) and the current question (what is asked), in addition to grounding information in the image, to accurately generate the correct response. In this paper, we show that co-attention models which explicitly encode dialoh history outperform models that don’t, achieving state-of-the-art performance (72 % NDCG on val set). However, we also expose shortcomings of the crowdsourcing dataset collection procedure, by showing that dialogue history is indeed only required for a small amount of the data, and that the current evaluation metric encourages generic replies. To that end, we propose a challenging subset (VisdialConv) of the VisdialVal set and the benchmark NDCG of 63%.


Introduction
Recently, there has been an increased interest in visual dialog, i.e. dialog-based interaction grounded in visual information (Chattopadhyay et al., 2017;De Vries et al., 2017;Seo et al., 2017;Guo et al., 2018;Shekhar et al., 2018;Kottur et al., 2019;Haber et al., 2019).One of the most popular test beds is the Visual Dialog Challenge (VisDial) (Das et al., 2017), which involves an agent answering questions related to an image, by selecting the answer from a list of possible candidate options.According to the authors, nearly all interactions (98%) contain dialog phenomena, such as co-reference, that can only be resolved using dialog history, which makes this a distinct task from previous Visual Question Answering (VQA) challenges, e.g.(Antol et al., 2015).For example, in order to answer the question "About how many?" in Figure 1, we have to infer from what was previously said, that the conversation is about the skiers.(Das et al., 2017) as a ranking problem, where for the current question (blue), the agent ranks list of 100 candidate answers (yellow).Relevance weights for each candidate were collected via crowd-sourcing.Previous dialog history (red) together with the caption (green) forms the contextual information for the current turn.
In the original paper, Das et al. (2017) find that models which structurally encode dialog history, such as Memory Networks (Bordes et al., 2016) or Hierarchical Recurrent Encoders (Serban et al., 2017) improve performance.However, "naive" history modelling (in this case an encoder with late fusion/concatenation of current question, image and history encodings) might actually hurt performance.Massiceti et al. (2018) take this even further, claiming that VisDial can be modeled without taking history or even visual information into account.Das et al. (2019) rebutted by showing that both features are still needed to achieve state-of-the-art (SOTA) results and an appropriate evaluation procedure has to be used.
In this paper, we show that competitive results on VisDial can indeed be achieved by replicating the top performing model for VQA (Yu et al., 2019b) -and effectively treating visual dialog as multiple rounds of question-answering, without taking history into account.However, we also show that these results can be significantly improved by encoding dialog history, as well as by fine-tuning on a more meaningful retrieval metric.Finally, we show that more sophisticated dialog encodings outperform naive fusion on a subset of the data which contains "true" dialog phenomena according to crowd-workers.In contrast to previous work on the VisDial dataset, e.g.(Kottur et al., 2018;Agarwal and Goyal, 2018;Gan et al., 2019;Guo et al., 2019;Kang et al., 2019), we are the first to conduct a principled study of dialog history encodings.Our contributions can thus be summarized as follows: • We present SOTA results on the VisDial dataset using transformer-based Modular Co-Attention (MCA) networks.We further show that models encoding dialog history outperform VQA models on this dataset.• We show that curriculum fine-tuning (Bengio et al., 2009) on annotations of semantically equivalent answers further improves results.• We experiment with different dialog history encodings and show that early fusion, i.e. dense interaction with visual information (either via grounding or guided attention) works better for cases where conversational historical context is required.• We release a crowd-sourced subset containing verified dialog phenomena and provide benchmark results for future research.

Visual Dialog Models
In this section, we extend Modular Co-Attention Networks, which won the VQA challenge 2019 (Yu et al., 2019b) and adapt it to visual dialog.Different from previous co-attention networks (Kim et al., 2018;Nguyen and Okatani, 2018), MCA networks use guided attention to model dense relations between the question and image regions for better visual grounding.In the following, we explore MCA networks with different input encodings following a '[model]-[input]' convention to refer to our MCA model variants; see Figure 3 for an overview.Whenever unspecified, images are represented as a bag of bottom-up features, i.e. object level representations (see Section 3).

Modular Co-Attention networks
The MCA module with multi-modal fusion as depicted in Figure 2, is common to all our architectures.Inspired by the transformers (Vaswani et al., 2017), the MCA network (Yu et al., 2019b) is a modular composition of two basic attention units: self-attention and guided attention.These are arranged in an encoder-decoder composition in the MCA module (Figure 2), which performed best for VQA (Yu et al., 2019b).

Self-Attention and Guided-Attention
The Self-Attention (SA) unit in transformers (Vaswani et al., 2017) is composed of a multihead attention layer followed by a feed-forward layer.When applied to vision, the SA unit can be viewed as selecting the most relevant object-level image features for the downstream task.Specifically, the scaled dot product attention takes as input key, query and value (usually same modality's embedded representations) and outputs a self-attended vector (Eq.1).Multi-head attention provides multiple representation spaces to capture different linguistic/grounding phenomena, which are otherwise lost by averaging using a single head.
The Guided-Attention (GA) unit conditions the attention on different sequences.The key and value come from one modality, while the query comes from a different modality similar to the decoder architecture in Transformers (Vaswani et al., 2017).Similar to Eq. 1, the GA unit outputs features f i = Att(X, Y, Y ) where X ∈ R m×d x comes from one modality and Y ∈ R n×d y from the other.Residual connection (He et al., 2016) and layer normalization (Ba et al., 2016) are applied to the output of both the attention and feed-forward layers similar to (Vaswani et al., 2017;Yu et al., 2019b) in both the SA and GA units.

Modular Co-Attention Module
The following description of the MCA module is based on the question and the image, but can be extended analogously to model the interaction between the question and history.First, the input (i.e. the question) is passed through multiple multi-head self-attention layers L, in order to get self-aware representations before acting as conditional signal to different modalities (visual or contextual history) similar to the auto-encoding procedure of Transformers.Then the final representation X L is used as the input for GA units to model cross-modal dependencies and learn the final conditioned representation Y L .

Multi-modal fusion
The learned representations X L ∈ R m×d and Y L ∈ R n×d contain the contextualized and conditioned representations over the word and image regions, respectively.We apply attention reduction (Yu et al., 2019b) with a multi-layer perceptron (MLP) for X L (analogously for Y L ).We obtain the final multi-modal fused representation z: where α attention weights (same process for α y and ỹ) and d×d z are linear projection matrices (dimensions are the same for simplicity).
We call this model MCA with Image component only; (MCA-I), since it only encodes the question and image features and therefore treats each question in Visual Dialog as an independent instance of VQA, without conditioning on the historical context of the interaction.

Variants with Dialog History
In the following, we extend the above framework to model dialog history.We experiment with late/shallow fusion of history and image (MCA-I-H), as well as modelling dense interaction between conversational history and the image representation (i.e.MCA-I-VGH, MCA-I-HGuidedQ).

History guided Question (MCA-I-HGuidedQ):
The network in Figure 3a is designed to model coreference resolution, which can be considered as the primary task in VisDial (Kottur et al., 2018).We first enrich the question embedding by conditioning on historical context using guided attention in the MCA module.We then use this enriched (coreference resolved) question to model the visual interaction as described in Section 2.1.
Visually grounded history with image representation (MCA-I-VGH): Instead of considering conversational history and the visual context as two different modalities, we now ground the history with the image first, see Figure 3b.This is similar in spirit to maintaining a pool of visual attention maps (Seo et al., 2017), where we argue that different questions in the conversation attend to different parts of the image.Specifically, we pass the history to attend to object-level image features using the MCA module to get visually grounded contextual history.We then embed the question to pool the relevant grounded history using another MCA module.In parallel, the question embedding is also used to ground the current visual context.At the final step, the respective current image and historical components are fused together and passed through a linear layer before decoding.Note, this model is generic enough to potentially handle multiple images in a conversation and thus could be extended for tasks e.g.conversational image editing, which is one of the target applications of visual dialog (Kim et al., 2017;Manuvinakurike et al., 2018a,b;Lin et al., 2018;El-Nouby et al., 2018).

Two-stream Image and History component (MCA-I-H):
Figure 3c shows the model which maintains two streams of modular co-attention networks -one for the visual modality and the other for conversational history.We follow a similar architecture for the visual component as MCA-I and duplicate the structure for handling conversational history.At the final step, we concatenate both the embeddings and pass them through a linear layer.

Decoder and loss function
For all the models described above, we use a discriminative decoder which computes the similarity between the fused encoding and RNN-encoded answer representations which is passed through a softmax layer to get the probability distribution over the candidate answers.We train using cross entropy over the ground truth answer: N denotes the number of candidate answers which is set to 100 for this task, y n is the (ground truth) label which is 0 or 1 during the training procedure, or a relevance score of the options during fine-tuning (casting it as multi-label classification).
We use the Adam optimizer (Kingma and Ba, 2015) both for training and fine-tuning.More training details can be found in Appendix A.
4 Task Description

Dataset
We use VisDial v1.0 for our experiments and evaluation. 3The dataset contains 123K/2K/8K dialogs for train/val/test set respectively.Each dialog is crowd-sourced on a different image, consisting of 10 rounds of dialog turns, totalling approx.1.3M turns.Each question has also been paired with a list of 100 automatically generated candidate answers which the model has to rank.To account for the fact that there can be more than one semantically correct answer (e.g."Nope", "No", "None", "Cannot be seen"), "dense annotations" for 2k/2k turns of train/val of the data have been provided, i.e. a crowd-sourced relevance score between 0 and 1 (1 being totally relevant) for all 100 options.

Evaluation protocol
As the Visual Dialog task has been posed as a ranking problem, standard information retrieval (IR) metrics are used for evaluation, such as Recall@{1,5,10} to measure performance in the top N results (higher better), mean reciprocal rank (MRR) of the Ground-Truth (GT) answer (higher better), and Mean rank of the GT answer (lower better).Normalized Discounted Cumulative Gain (NDCG) is another measure of ranking quality, which is commonly used when there is more than one correct answer (provided with their relevance).

Training details
Sparse Annotation Phase: We first train on sparse annotations, i.e.only 1 provided groundtruth answer, which is available for the whole training set.Here the model learns to select only one relevant answer.
Curriculum Fine-tuning Phase: Dense annotations, i.e. crowd-sourced relevance weights, are provided for 0.16% of training set, which we use to fine-tune the model to select multiple semantically equivalent answers.This acts like a curriculum learning setup (Elman, 1993;Bengio et al., 2009), 3 Following the guidelines on the dataset page we report results only on v1.0, instead of v0.9.VisDial v1.0 has been consistently used for Visual Dialog Challenge 2018 and 2019.
where selecting one answer using sparse annotation is an easier task and fine-tuning more difficult.1.When reporting on the test set results in Table 2, we use the leaderboard scores published online which contains further unpublished enhancements based on ensembling (MReaL-BDAI).

Results
In the following, we report results on the VisDial v1.0 val set, (Table 1), as well as the test-std set5 , (   Compared to MCA-I, which treats the task as multiple rounds of VQA, encoding history improves results, but only significantly for MCA-I-VGH in the sparse annotation phase.After fine-tuning, MCA-I-VGH and MCA-I-H perform equally.MCA-I-H implements a late/shallow fusion of history and image.Architectures which model dense interaction between the conversational history and the image representations (i.e.MCA-I-VGH, MCA-I-HGuidedQ) perform comparably; only MCA-HConcQ performs significantly worse.Note that MCA-I also outperforms the baselines and current SOTA by a substantial margin (both in the sparse annotation phase and curriculum finetuning phase), while, counter-intuitively, there is not a significant boost by adding conversational history.This is surprising, considering that according to Das et al. (2017), 38% of questions contain a pronoun, which would suggest that these questions would require dialog history in order to be "understood/grounded" by the model.
Furthermore, curriculum fine-tuning significantly improves performance with an average improvement of 11.7 NDCG points, but worsens performance in terms of the other metrics, which only consider a single ground truth (GT) answer.

Error Analysis
In the following, we perform a detailed error analysis, investigating the benefits of dialog history en-coding and the observed discrepancy between the NDCG results and the other retrieval based metrics.

Dialog History
We performed an ablation study whereby we did not include the caption as part of historical context and compare with the results in Table 1.The performance dropped from (NDCG 72.2,MRR 42.3) to (NDCG 71.6,MRR 40.7) using our best performing MCA-I-H model after finetuning.Since the crowd-sourced conversation was based on the caption, the reduced performance was expected.
In order to further verify the role of dialog history, we conduct a crowd-sourcing study to understand which questions require dialog history, in order to be understood by humans.We first test our history-encoding models on a subset (76 dialogs) of the recently released VisPro dataset (Yu et al., 2019a) which focuses on the task of Visual Pronoun Resolution.
6 Note that VisPro also contains non-referential pleonastic pronouns, i.e. pronouns used as "dummy subjects" when e.g.talking about the weather ("Is it sunny?").We thus create a new crowd-sourced dataset7 , which we call VisDialConv.This is a subset of the VisDial val-set consisting of 97 dialogs, where the crowd-workers identified single turns (with dense annotations) requiring historical information.In particular, we asked crowd-workers whether they could provide an answer to a question given an image, without showing them the dialog history, and select one of the categories in Table 4 (see further details in Appendix B).
In order to get reliable results, we recruited 3 crowd-workers per image-question pair and only kept instances where at least 2 people agreed.Note that we only had to discharge 14.5% of the origi-Table 3: Automatic evaluation on the subsets of VisPro and VisDialConv dataset.We found history based MCA models to outperform significantly compared to the MCA-I model.On VisDialConv, MCA-I-VGH still outperform all other models in spare annotation phase while MCA-I-HGuidedQ performs the best after fine-tuning.

Annotation
Count   4 show that only 11% required actual dialog historical context according to the crowd-workers.Most of the time (67% cases), crowd-workers said they can answer the question correctly without requiring history.The results in Table 3 are on the subset of 97 questions which the crowd-workers identified as requiring history.
8 They show that history encoding models (MCA-I-HGuidedQ / MCA-I-HConcQ / MCA-I-H / MCA-I-VGH) significantly outperform MCA-I, suggesting that this data cannot be modelled as multiple rounds of VQA.It can also be seen that all the models with dense (early) interaction of the historical context outperform the one with late interaction (MCA-I-H) in terms of NDCG.Models with dense interactions appear to be more reliable in choosing other correct relevant answers because of the dialog context.
Our best performing model on VisDialConv is MCA-I-HGuidedQ and achieves a NDCG value of 62.9 after curriculum fine-tuning.However, on the VisPro subset, we observe that MCA-I-H still outperforms the other models.Interestingly, on this set, MCA-I also outperforms other history encoding models (except for MCA-I-H).
In sum, our analysis shows that only a small subset of the VisDial dataset contains questions which require dialog history, and for those, models which encode history lead to better results.We posit that this is due to the fact that questions with pleonastic pronouns such as "Is it sunny/daytime/day. . ." are the most frequent according to our detailed analysis in Appendix C about the dialog phenomena.Table 5: Relevance score (dense annotation) provided for 2k/2k train/val QA turns.We find that 20% of the ground truth answers were marked as irrelevant (0 score) and partially relevant (0.5 score) by the human annotators for train set.This can be attributed to human errors made while collecting the original data as well as when crowd-sourcing the dense annotations.

Dense Annotations for NDCG
Here, we investigate the discrepancy between the NDCG results and the other retrieval-based methods.First, we find that the annotation scales differs: while there is a 3-way annotation on the train set, the val set defines 6 possible relevance classes, see Table 5.This affects the evaluation results of our   model, for which we can't do much.Next, a manual inspection reveals that the relevance weight annotations contain substantial noise: We find that ground truth answers were marked as irrelevant for about 20% of train and 10% of val set.Thus, our models seem to get "confused" by fine-tuning on this data.We, therefore, manually corrected the relevance of only these GT answers (in dense annotations of train set only, but not in val set).Please see Appendix D for further details.The results in Table 1 (for MCA-I-H-GT) show that the model fine-tuned on the corrected data still achieves a comparable NDCG result, but substantially improves stricter (single answer) metrics, which confirms our hypothesis.

Discussion and Related Work
Our results suggest that the VisDial dataset only contains very limited examples which require dialog history.Other visual dialog tasks, such as GuessWhich?(Chattopadhyay et al., 2017) and GuessWhat?! (De Vries et al., 2017) take place in a goal-oriented setting, which according to Schlangen (2019), will lead to data containing more natural dialog phenomena.However, there is very limited evidence that dialog history indeed matters for these tasks (Yang et al., 2019).As such, we see data collection to capture visual dialog phenomena as an open problem.
Nevertheless, our results also show that encoding dialog history still leads to improved results.This is in contrast with early findings that a) "naive" encoding will harm performance (Das et al. (2017); at least 13.3% of answers are non-commital (I cannot tell, Not sure, I can't tell).
Furthermore, we find that our model learns to provide generic answers by taking advantage of the NDCG evaluation metric.Learning generic answers is a well-known problem for open-domain dialog systems, e.g.(Li et al., 2016).While the dialog community approaches these phenomena by e.g.learning better models of coherence (Xu et al., 2018), we believe that evaluation metrics also need to be improved for this task, as widely discussed for other generation tasks, e.g.(Liu et al., 2016;Novikova et al., 2017;Reiter, 2018).As a first step, BERT score (Zhang et al., 2019) could be explored to measure ground-truth similarity replacing the noisy NDCG annotations of semantic equivalence.

Conclusion and Future Work
In sum, this paper shows that we can get SOTA performance on the VisDial task by using transformerbased models with Guided-Attention (Yu et al., 2019b), and by encoding dialog history and finetuning we can improve results even more.
Of course, we expect pre-trained visual BERT models to show even more improvements on this task, e.g.Vilbert (Lu et al., 2019), LXMert (Tan and Bansal, 2019), UNITER (Chen et al., 2019) etc.However, we also show the limitations of this shared task in terms of dialog phenomena and evaluation metrics.We, thus, argue that progress needs to be carefully measured by posing the right task in terms of dataset and evaluation procedure.
(2018) and used static 36 as the number of object proposals in our experiments (though our model can handle dynamic number of proposals).
We experimentally determined the learning rates of 0.0005 for training MCA models and 0.0001 for fine-tuning and reducing it by 1/10 after every 7 and 10 epochs out of a total of 12 epochs for training and 1/5 after 2 epochs for fine-tuning.
We use pytorch's LambdaLR scheduler while training and ReduceLROnPlateau for the finetuning procedure.Dropout of 0.2 is used for regularization and we perform early stopping and saved the best model by tracking the NDCG value on val set.Layer normalisation (Ba et al., 2016) is used for stable training following (Vaswani et al., 2017;Yu et al., 2019b).Attention reduction consisted of 2 layer MLP (fc(d)-ReLU-Dropout(0.2)-fc(1)).
We also experimented with different contextual representations, including BERT (Devlin et al., 2019); However we didn't observe any improvement, similar to the observation by (Tan and Bansal, 2019).
For the results on the validation set, only the training split is used.To report results on test-std set, both the training and val set are used for training.For curriculum fine-tuning we use multi-class cross entropy loss where weighted by the relevance score.All our MCA modules have 6 layers and 8 heads, which we determined via a hyper parameter search.Table 7 shows more details.

Figure 1 :
Figure1: Visual Dialog task according to(Das et al., 2017) as a ranking problem, where for the current question (blue), the agent ranks list of 100 candidate answers (yellow).Relevance weights for each candidate were collected via crowd-sourcing.Previous dialog history (red) together with the caption (green) forms the contextual information for the current turn.

Figure 3 :
Figure 3: All models incorporating dialog history described in Section 2.2 : MCA-I-HConcQ is a naive approach of concatenating raw dialog history to the question while keeping the rest of the architecture the same as MCA-I.MCA-H on the other hand considers this task as only conversational (not visual) dialog with MCA module on history instead of image.RvA: We reproduce the results of Niu et al. (2019)'s Recursive Visual Attention model (RvA), which won the 2019 VisDial challenge.Their model browses the dialog history and updates the visual attention recursively until the model has sufficient confidence to perform visual co-reference resolution.We use their single model's opensource implementation and apply our fine-tuning procedure on the val set in Table

Figure 4 :
Figure 4: Top-5 ranked predictions (relevance in parentheses) of MCA-I-H and MCA-I-VGH after both sparse annotation and curriculum fine-tuning phase.R GT defines the rank of Ground Truth (GT) predicted by the model.We also calculate NDCG of rankings for current question turn.N Rel denotes number of candidate answer options (out of 100) with non-zero relevance (dense annotations).Here ♣ and ♦ represents predictions after sparse annotation and curriculum fine-tuning respectively.

Table 4 :
Results of crowd-sourcing study to understand whether humans require dialog history to answer the question.'VQA turns' indicate that humans could potentially answer correctly without having access to the previous conversation while 'History required' are the cases identified requiring dialog context.We also identified the cases requiring world knowledge/ common sense, guessing and questions not relevant to the image.
nal 1035 image-question pairs, leaving us with 885 examples.The results in Table History required I want to know what was discussed before to answer confidently.Cannot answer with just the question and image.Need more information (context) from previous conversation.

Table 6 :
Mapping of human annotation with the actual text shown to the user.