Video Question Answering with Phrases via Semantic Roles

Video Question Answering (VidQA) evaluation metrics have been limited to a single-word answer or selecting a phrase from a fixed set of phrases. These metrics limit the VidQA models’ application scenario. In this work, we leverage semantic roles derived from video descriptions to mask out certain phrases, to introduce VidQAP which poses VidQA as a fill-in-the-phrase task. To enable evaluation of answer phrases, we compute the relative improvement of the predicted answer compared to an empty string. To reduce the influence of language bias in VidQA datasets, we retrieve a video having a different answer for the same question. To facilitate research, we construct ActivityNet-SRL-QA and Charades-SRL-QA and benchmark them by extending three vision-language models. We perform extensive analysis and ablative studies to guide future work. Code and data are public.


Introduction
Given a video, Video Question Answering (VidQA) requires a model to provide an answer to a video related question. However, existing works treat VidQA as an N-way (N ∼1k) classification task across a fixed set of phrases. Models trained under such formulations are strictly restricted in their recall rate, generalize poorly, and have severe limitations for end-user applications.
In this work, we introduce Video Question Answering with Phrases (VidQAP) which treats VidQA as a fill-in-the-phrase task. Instead of a question, the input to VidQAP consists of a query expression with a query-token. Then, given a video, VidQAP requires replacing query-token with a sequence of generated words. To generate a query, we leverage video descriptions and assign semantic roles to each phrase in these descriptions. Replacing a particular semantic-role with a query token produces a query-answer pair. We illustrate this in Figure 1 (details in Section 3.1). While free-form answer generation is highly desirable, evaluating them is non-trivial due to two main challenges. First, existing language generation metrics like BLEU (Papineni et al., 2002) or BERTScore (Zhang* et al., 2020) operate on sentences rather than phrases. When applied to short phrases, in the absence of context, even close matches like "A person" and "The man" would be falsely rejected due to no n-gram overlap or poor contextual embeddings. Second, natural language questions often have strong language priors making it difficult to ascertain if the model retrieved information from the video.
To propose a reasonable evaluation metric, we With this key insight, we propose relative scoring: using the description as reference sentence, we compute the metrics once by replacing the querytoken once with the predicted answer phrase and once with an empty-string. The model's performance is measured by the relative improvement from the predicted answer compared to the empty string. In particular, substituting the answer phrase in the query expression allows the computing the contextual embeddings required by BERTScore.
To mitigate the language-bias issue, we emulate the procedure proposed by (Goyal et al., 2017) where for a given question, another image (or video in our case) is retrieved which has a different answer for the same question. To retrieve such a video, we use a contrastive sampling method (Sadhu et al., 2020) over the dataset by comparing only the lemmatized nouns and verbs within the semantic roles (SRLs). We then propose contrastive scoring to combine the scores of the two answer phrases obtained from the contrastive samples (details on evaluation in Section 3.2).
To investigate VidQAP, we extend three visionlanguage models namely, Bottom-Up-Top-Down (Anderson et al., 2018), VOGNet (Sadhu et al., 2020) and a Multi-Modal Transformer by replacing their classification heads with a Transformer (Vaswani et al., 2017) based language decoder. To facilitate research on VidQAP we construct two datasets ActivityNet-SRL-QA (ASRL-QA) and Charades-SRL-QA and provide a thorough analysis of extended models to serve as a benchmark for future research (details on model framework in Section 3.3 and dataset creation in Section 4.1).
Our experiments validate the merits of mov-ing away from N-way classification, and further show even among sequence generation models there exists a large disparity in performance across semantic-roles (i.e. queries for some roles can be answered very easily compared to other roles). Moreover, certain roles hardly benefit from visionlanguage models suggesting room for improvement. Finally, we investigate the effects of relative scoring and contrastive scoring for VidQAP with respect to BertScore. Our contributions in this work are two-fold: (i) we introduce VidQAP and propose a systematic evaluation protocol to leverage state-of-art language generation metrics and reduce language bias (ii) we provide extensive analysis and contribute a benchmark on two datasets evaluated using three vision-language models. Our code and dataset are publicly available. 1 2 Related Works Question Answering in Images has received extensive attention in part due to its end-user applicability. Key to its success has been the availability of large-scale curated datasets like VQA v2.0 (Goyal et al., 2017) for visual question answering and GQA (Hudson and Manning, 2019) for relational reasoning. To address the strong language priors, the datasets are balanced by retrieving images which given the same question lead to a different answer. However, these procedures cannot be extended for VidQA since crowd-sourcing to retrieve videos is expensive and there exists no scene-graph annotations for videos. In this work, we perform the retrieval using lemmatized nouns and verbs of the semantic roles labels obtained from video descriptions to balance the dataset.
Question Answering in Videos: has garnered less attention compared to ImageQA. A major bottleneck is that there is no principled approach to curating a VidQA dataset which reflects the diversity observed in ImageQA datasets. For instance, naively crowd-sourcing video datasets leads to questions about color, number which is same as ImageQA datasets and doesn't reflect any spatialtemporal structure. To address this issue, TGIF-QA (Jang et al., 2017) and ActivityNet-QA (Yu et al., 2019) use a question-template to enforce questions requiring spatio-temporal reasoning but forgo the question diversity. An orthogonal approach is to combine VidQA with movie scripts (Tapaswi et al., 2016) or subtitles . However, this severely restricts the domain of videos. Moreover, recent works have noted that language-only baselines often outperform vision-language baselines (Jasani et al., 2019;Zellers et al., 2019). A separate line of related research has focused on scene-aware dialogue (Alamri et al., 2019). Instead of a single annotator providing both questions and answers, the annotation procedure follows a two-player game setup with one player asking a question and the other player answering with the roles switching after each turn. However, the evaluation method utilizes recall metrics which require the set of phrases to be known apriori. As a result, it doesn't strictly measure the performance of free-form generation but rather how well the ground-truth answer is ranked given a competing set of phrases which is analogous to multiple-choice questions.
Automatic Question Generation: Due to the above limitations, the dominant approach to create large-scale VidQA dataset has been automatic question generation from existing video descriptions which can be easily crowd-sourced. Our proposed formulation of using SRLs to generate queryexpressions falls in this category. Prior works include VideoQA (Zeng et al., 2017), MSR-VTT-QA and MSVD-QA (Xu et al., 2017) which use a rule based question generator (Heilman and Smith, 2009) to convert descriptions to questions and Movie-Fill-in-the-Blanks (Maharaj et al., 2017) which mask outs at most one word which could be a noun, adjective or verb in a sentence. In comparison, our method poses VidQAP as fill-in-blanks but with phrases, explicitly asks questions about actions, and the answer phrases are not constrained to a fixed set. As a result of this increased space of phrases, methods on existing datasets cannot be directly applied to VidQAP. To enable further research, we contribute two datasets ASRL-QA and Charades-SRL-QA. In Table 1 we compare these with existing VidQA datasets.
SRL in Vision: has been explored in the context of human object interaction (Gupta and Malik, 2015), situation recognition (Yatskar et al., 2016), and multi-media extraction (Li et al., 2020). Most related to ours is the usage of SRLs for grounding (Silberer and Pinkal, 2018) in images and videos (Sadhu et al., 2020). Our work builds on (Sadhu et al., 2020) in using SRLs on video descriptions, however, our focus is not on grounding. Instead, we use SRLs primarily as a query generation tool and use the argument as a question directive.

Design Considerations for VidQAP
The VidQAP task is conceptually simple: given a video and a query expression with a query-token, a model should output an answer phrase that best replaces the query-token. This leads to three main design considerations: (i) How to generate a queryexpression from existing resources (Section 3.1) (ii) How to evaluate the answer phrases returned by a model (Section 3.2) (iii) What modeling framework choices enable VidQAP (Section 3.3).

Using SRLs to Generate Queries for VidQAP
We first briefly describe semantic-role labels (SRLs) 2 . Then we detail how SRLs are used to create VidQAP queries. Query Generation Using SRLs: Semantic Role Labels (SRLs) provide a high-level label to entities extracted from a sentence in the form of who (ARG0), did what (V) to whom (ARG1) (Strubell et al., 2018). Other roles such as to whom / using what (ARG2) and where (LOC) are also common. As a pre-processing step, we assign SRLs to video descriptions using a state-of-art SRL labeler (Shi and Lin, 2019). A particular description could consist of multiple verbs, in which case, we consider each verb and its associated SRLs independently. For a particular semantic-role, we substitute the corresponding phrase with a query token to generate the query expression. The replaced phrase is the corresponding answer. Using this method we   are able to generate multiple queries from a single description. An added merit of using SRLs is that query phrases are centered around "verb-phrases" which are highly relevant to the video content.
Generating queries using every SRL is not beneficial as some SRLs are more concerned with phrasing of the language rather than the video. For instance, in the phrase "Players are running around on the field", if we mask out the word "around" (DIR), it can be answered without looking at the video. To address the above issue, we confine our description phrases to a fixed set of semantic-roles namely: ARG0, ARG1, V, ARG2, ARGM-LOC. Only those phrases which belong to the above set of SRLs may appear in the query-expression or as an answer phrase. We further remove phrases which have only two arguments as these are too ambiguous to fill. Figure 2 illustrates these steps.
While using a slot for each slot could potentially limit the vocabulary used in each slot (for instance, the vocabulary set for <Q−ARG1> could be limited to a small number of objects), empirically we don't find this to be the case (see Appendix A.3 for detailed statistics). As a result, VidQAP is no simpler than VidQA task.
We also remark that generating queries need not be strictly limited to masking out a single SRL and one could easily mask multiple SRLs in the same description. However, we find two problems: first, for many cases, the output of masking multiple SRLs becomes exceedingly similar to video description task; second, using contrastive scoring (described in Section 3.2) for multiple SRLs be-

Query Expression:
A person <Q-V> exercise equipment.

Reference (Ground Truth): A person moves exercise equipment. Hypothesis (Prediction):
A person lifts exercise equipment. Baseline (Empty String): A person exercise equipment. "moves" is the ground-truth answer and "lifts" is a model's prediction. Relative Metric compares the relative improvement from using the model's prediction as compared to an empty string.
A person holding <Q-ARG1> in their hands Answer: a dog Answer: a hair dryer comes considerably more involved. As a result, in this work, we focus on using a single SRL and keep the generalization to include multiple SRL queries for future work.

Evaluating Answer Phrases
A key challenge in VidQAP is the lack of any standard protocol to evaluate free-form generated phrases. A simple way is to adopt metrics like BLEU (Papineni et al., 2002), ROUGE (Lin, 2004), METEOR (Banerjee and Lavie, 2005), and CIDER  which are already used for captioning in images and videos. However, these metrics suffer from limited generalization: BLEU, ROUGE, and CIDER require exact n-gram matches. While this is fine for captioning where longer phrases average out errors, answers phrases are typically much smaller than a complete sentence. This leads to many near-correct answers receiving very low scores. This issue is resolved to a certain extent for captioning by learned metrics like BERTScore (Zhang* et al., 2020) which utilize contextual embeddings obtained from large pretrained models like BERT (Devlin et al., 2019) and RoBerta . However, answer phrases are usually short and don't provide meaningful contextual embeddings. In the extreme case when the answer is a single word, for instance when the query is about a Verb, these embeddings turn out to be very noisy leading to large number of false-positives.
Relative Scoring: To enable usage of contextual embeddings, we propose evaluating the relative improvement of the generated answer phrase compared to the ground-truth phrase. We denote the input query expression as Q, the ground-truth answer is A gt ,and the predicted answer is A pred . Let Q(X) denote Q with the question tokens replaced by X. Then for a given metric B, we compute the relative metric B r as (see Figure 3 for illustration) Ref for BLEU, METEOR, ROUGE, BERTScore but not for CIDEr. The empty-string baseline in Eqn 1 could be replaced with predictions from any model trained for this task. In this work, we restrict to only emptystring baseline due to two desirable properties: its computational simplicity and it being agnostic to models and datasets.
We further observe that Eqn 1 is very similar to the re-scaling proposed in BERTScore. However, in BertScore re-scaling aims at making the score more readable and doesn't change the relative ranking of the hypothesis. In our case, Eqn 1 plays two roles: first, it allows computing the contextual embeddings because the answers are now embedded inside a complete phrase, second while the ranking is not affected for a particular query, the score would be different across queries and hence affect the overall relative metric.
Contrastive Scoring: Visual Question Answering suffers from heavy language priors, and as a result, it is often difficult to attribute whether the image or video played a role in the success. For images, (Goyal et al., 2017) resolved this by balancing the dataset where they crowd-sourced the task of collecting an image that has a different answer for the same question. However, such a crowdsourcing method is difficult to extend to videos since searching for videos requires a much longer time. This is further complicated by accepting answer phrases compared to single word.
We simulate the balancing process using the contrastive sampling method used in (Sadhu et al., 2020). Specifically, for a given video-query-answer (V 1 , Q 1 , A 1 ) tuple we retrieve another video-queryanswer (V 2 , Q 2 , A 2 ) tuple which share the same semantic role structure as well as lemmatized noun and verbs for the question, but a different lemmatized noun for the answer. At test time, the model evaluates the question separately, but the evaluation function requires both answers to be correct. Since our answers comprise of phrases, the notion of correctness is not absolute (unlike say accuracy metric). Thus, we put a threshold t below which the answer is deemed incorrect.
Mathematically, let S i =B r (A gt i , A pred i ) be the relative score for sample i, and we are given sample j is a contrastive example for sample i. Then the contrastive score (CS i ) for sample i at a threshold T CS would be Here 1[] is the indicator variable which is 1 if the expression within brackets is True, otherwise 0. The max operator ensures the scores don't become negative. For our experiments, we use T CS =0 which requires that the answer for the contrastive sample should be better than an empty string.
We further use the contrastive samples to compute a consistency metric. For sample i, the consistency Cons i for a threshold T cons is given by As such, Consistency requires the model to be either correct or incorrect for both the original and the contrastive sample.
Combined Metric at a Glance: Given metric B, for a given sample i and contrastive sample j 1. Compute relative metric (Eqn 1) for i, j 2. Compute contrastive score (Eqn 2)

Optionally compute Consistency (Eqn 3)
We use the prefix "R-" such as in R-B to denote both relative scoring and contrastive scoring is being computed. We report Consistency for BertScore with T cons =0.1 We note that, by construction, the relative scoring (Eqn 1) is positively correlated with human judgment, as the closer, the hypothesis is to the reference, the higher would the score be. The contrastive scoring is a metric used to prevent the model from guessing the correct answer by exploiting language biases and instead use the video to give a suitable prediction. Since humans don't have the ability to exploit such biases, it is difficult to relate to human evaluation.

Model Framework
Models for VidQAP require a language encoder to encode the question, a visual encoder to extract video features, a multi-modal module to jointly learn over vision-language space and a decoder to generate a sequence of words.
Inputs include query expression {w} L i=1 (L is number of words), video segment features for F 1 frames and optionally k RCNN features for F 2 frames. In either case, frames are sampled uniformly from the video segment time-span. While the models differ in their encoding scheme, our language decoder model (Transformer based) used to generate the output answer phrase is kept same across all models with QAP suffix.
Lang-QAP: is a language-only (video-blind) model using only the query input. It uses Transformer based encoder to encode the query intô q ∈ R L×d . The decoder subsequently uses the last layer output of the encoder (Figure5-(a)).
BUTD-QAP: Bottom-up-Top-Down (Anderson et al., 2018) is a popular approach for image question answering as well as captioning. It first computes attention between the question and the RCNN visual features to generate an attended visual feature, which is then used with the question to produce an output answer. Here, we replace the RCNN features with the segment features (v ∈ R F 1 ×d ). We can also include RCNN features by projecting them to same dimension as segment features and then concatenate them along the frame-axis (v ∈ R (F 1 +F 2 * k)×d ). For language features, we use the [CLS] token representation from the last layer of the language encoder used in Lang-QAP.
The output using the language and visual features is (m ∈ R d ) passed to the decoder ( Figure 5(b)).
VOG-QAP: VOGNet (Sadhu et al., 2020) has been proposed for grounding objects in videos given a natural language query. Following the architecture, we first derive phrase encoding which corresponds to a single SRL i.e.q ∈ R S×d (S is number of semantic roles). These phrase features are concatenated with the visual features (same as those used in BUTD-QAP (i.e.v)) to get multimodal features m[l, i]=[v i ||q l ] and then reshaped to get m ∈ R S * F ×d . These multi-modal features are subsequently passed to decoder to generate the output sequence ( Figure 5 (c)).
MTX-QAP: Recently, transformer models pretrained on large-scale paired image-text data have become popular. Even in the absence of pretraining, such architectures can achieve competitive performance (Lu et al., 2019). In the context of videos, ActBert (Zhu and Yang, 2020) has been proposed. We create a similar architecture to ActBert but we replace their proposed Tangled-Transformer with a vanilla Transformer 3 . Specifically, we jointly encode the language and visual features in a single transformer and feed the output to the decoder ( Figure 5 (d)).
LangCL and MTxCL: Apart from QAP models, we also consider their phrase classification counterparts where the decoder is replaced with a N-way classifier (two-layered MLP in our case) across a fixed set of phrases. For our experiments, we used N =1k phrases for LangCL and N ∈{1k, 10k} for MTxCL.

Experiments
We briefly discuss the dataset creation process (Section 4.1), followed by experimental setup (Section 4.2). We then summarize our results (Section 4.3) and discuss key-findings. We provide implementation details, qualitative visualizations of our dataset, metrics and trained models in the appendix.
There are three key steps to create QA datasets from descriptions: (i) assign semantic-roles to the descriptions (ii) perform co-reference resolution so that the questions are self-contained (iii) obtain lemmatized nouns and verbs to perform contrastive sampling. For semantic-role labeling, we use (Shi and Lin, 2019). For co-reference resolution, we use the co-reference resolution model provided by allennlp library (Gardner et al., 2017)  Since Charades primarily involves videos with a single person, we discard questions involving ARG0. We limit to using a single description per video to avoid repetitive questions. We re-use the same train split for both datasets. For ASRL-QA, test set of ActivityNet is not public and Charades only has a test set but no official validation set. Thus, we split the existing validation set by video names and create the validation and test sets. For both validation and test splits, we remove those questions for which no contrastive sample was found as it indicates data-biases.

Experimental Setup
Dataset Statistics: ASRL-QA has 35.7k videos and 162k queries split into train, validation and test sets with 30.3k, 2.7k, 2.7k videos and 147k, 7.5k, 7.5k queries. We observe that the size of validation and test sets are proportionately smaller compared to their respective train sets. This is because only queries with corresponding contrastive sample are included while no such filtering is done for the train set (∼95k queries in train set have a contrastive pair). Charades-SRL-QA contains 9.4k videos and 71.7k queries split across train, validation and test 4 https://demo.allennlp.org/coreference-resolution sets with 7.7k, 0.8k, 0.8k videos and 59.3k, 6.1k, 6.2k queries. Despite its smaller size, the size of validation, test sets of Charades-SRL-QA is comparable to ASRL-QA as Charades is curated with the goal of diversifying subject, verb, object tuples. Supplementary material provides further details on the dataset statistics and visualizations.
Evaluation Metrics: As discussed in Section 3.2, we report the combined metric (i.e. metrics prefixed with "R-") for the commonly used generation metrics: BLEU, METEOR, ROUGE, CIDEr and BertScore (implementations from (Chen et al., 2015;Zhang* et al., 2020)). For BLEU, we report the sentence level BLEU-2. All reported results are test set results using the model which performs best on validation set.

Results and Discussions
Table 2 compares performance of the proposed VidQAP models with N-way classification baselines (denoted with suffix "CL") on ASRL-QA and Charades-SRL-QA.
Comparing Metrics: It is evident that compared to other metrics, R-BertScore shows a higher relative improvement. This is because BertScore allows soft-matches by utilizing contextual embeddings obtained from a pre-trained BERT (Devlin et al., 2019) or Roberta  model.
Comparison Across Datasets: We find that performance on both datasets follow very similar trends across all metrics. Charades-SRL-QA has slightly higher scores compared to ASRL-QA likely because it has lesser data variations (Charades is mostly confined indoor videos) suggesting findings on either dataset would transfer.
Comparison within N-way Classification: We notice that when 1k fixed set of phrases are used classification models show very limited performance. Allowing 10k phrases gives a significant improvement in performance on Charades-SRL-QA (12 points on R-BS) however this doesn't translate to ASRL-QA. This is because ASRL-QA contains many more probable phrases (29K compared to 8K) in their respective training sets. We also notice that increasing the number of phrases vocabulary coincides with decreasing consistency.
Comparing Free-from Answer Generation (QAP) with N-way Classification (CL): We investigate the advantages of using a decoder network to generate phrases compared to an N-way classification over a fixed set of phrases (denoted     with the suffix "CL" and number of phrases used in parenthesis). Table 2 shows that both Lang-QAP and MTX-QAP outperform their classification counterparts, namely Lang-CL and MTX-CL on both datasets. This implies the free-form generation are not limited to simply generating the most frequently appearing phrases in the training set, thereby showing its effectiveness. Comparison Across Models: We find that multi-modal models outperform language-only baseline. However, the improvement over language baseline is small. The reason for the small gap is elucidate in Table 3 where we report R-BertScore  for every considered SRL. We find a large disparity in performance depending on the SRL. Most strikingly, multi-modal models perform worse than language-only model on ARG0 and V. For ARG0, the strong performance of the Lang-QAP arises because most of the time the agent who causes an action is a human. Therefore answer phrases having simply "A man" or "A woman" or "A person" leads to reasonable performance. This additionally suggests that grounding "who" is performing the action remains non-trivial.
The more surprising result is the strong performance of Lang-QAP on V which is consistent across both datasets despite using contrastive sampling. There are two likely causes. First, the distinction between verbs is not as strict as object nouns, i.e. even similar verbs are classified as a separate verb diminishing the returns of contrastive sampling. For instance, "jumping" and "hoping" have different lemma and thus considered distinct verbs but R-BS would treat them as similar even if the specific action would be classified "jumping" rather than "hoping". Second, SRLs such as ARG1 confines the set of possible verbs. For instance, if the object is "glass", only limited verbs such as "drink", "hold" are probable.
On the remaining arguments namely ARG1, ARG2, and LOC, multi-modal models show a steady improvement over language-only baseline ranging from 1−10%. However, the performance in absolute terms remains very low. As such, our proposed task VidQAP remains extremely challenging for current multi-modal models.
Evaluation Metric Scores: In Table 4 we record the BertScore computation in three parts: directly computing over the answer phrases, performing relative scoring, finally performing contrastive scoring with different thresholds.
We observe that for V, naive computation leads to absurdly high scores. This is because verbs consist of a single word which means the embeddings are not contextual. This is remedied by relative scoring and is further controlled by combining with contrastive sampling.
Further note that relative scoring operates differently based on the SRLs. For instance, it increases the score for ARG0 and ARG1 where the answers more often paraphrased the ground-truth questions while for ARG2 and LOC, it decreases the score due to incorrect matches. While contrastive scoring is aimed at reducing language-only bias and as such should always reduce the relative score, we observe increased score in ARG2 for both Lang-QAP and MTX-QAP. This is caused by the max function which restricts the lower-limit to be 0.
Effect of Region Boxes: As noted earlier, the visual features can also include region features extracted from an object detector like FasterRCNN (Ren et al., 2015). In Table 5 we record the effect of including regional features. In particular, we use the GT5 setting used in (Sadhu et al., 2020) where 5 region proposals are used from 10 frames uniformly sampled from the video segment. Interestingly, MTX-QAP under-performs than both BUTD-QAP and VOG-QAP on ARG0. A possible reason is that the transformer is unable to effectively reason over both language and vision over such a large range of inputs.

Conclusion
In this work, we introduce Video Question Answering with Phrases (VidQAP) where we pose VidQA as a fill-in-the-phrase task. Given a video and query expression, a model needs to compose a sequence of words to answer. We then propose a method to leverage semantic roles from video descriptions to generate query expressions and outline a robust evaluation protocol. This involves computing the relative improvement of the prediction answer compared to an empty string followed by a contrastive sampling stage which reduces language-only biases. We then contribute two datasets ASRL-QA and Charades-SRL-QA to facilitate further on VidQAP and benchmark them with three visionlanguage models extended for our proposed task.

Ethics Statement
In this work, we propose an extension to the existing video question answering framework to include free-form answers and suggest how to evaluate such a task.
Direct Application (Positive): A direct application of our task would be to enrich existing descriptions obtained from video captioning models which could lead to better video retrieval results. For instance, one could query about what tool to use in order to cut a piece of cardboard by querying "A person cutting a piece of cardboard <Q-ARG2>".
Direct Application (Negative): Caution must be taken in directly applying models trained on descriptions without properly balancing the datadistributions as it is possible that hidden data-biases are amplified. As an example, ASRL-QA has many videos involving men throwing shot puts. As a result, a model could learn this biased correlation and whenever queried "who" (<Q-ARG0> throws a shot put) it would always produce the answer "man" even if the video clearly shows a "woman".
Broader Societal Impacts (Positive): Question answering is an excellent tool for diagnosing a model's understanding due to its high interactivity. Our proposed formulation takes this a step forward with answer phrases and can in-turn facilitate human-computer interactions. Our proposed model can be extended to down-stream tasks such as retrieving a video or retrieving a part of the video given a question or query.
Broader Societal Impacts (Negative): Since our method is agnostic to the end user case, it can be re-purposed to extract out sensitive information and be a threat to privacy.
• ARGM-LOC or simply LOC denotes the place or location where the verb takes place. For instance, in "A person is cutting a vegetable on a plate", "on a plate" is the LOC.
2. Query-Generation: • For each verb-role set within a description (each description can have multiple verbs), consider the role set ARG0, ARG1, V, ARG2, LOC for ASRL-QA and ARG1, V, ARG2, LOC for Charades-SRL-QA. • If there are at least 3 verb-roles for the given verb, for each SRL replace it with a query token (with <Q−{R}> where R is the role). This forms one query. Repeat for all SRLs in the considered set. • The minimum of 3 verb-roles is present to avoid ambiguity in the query. Limiting the argument role-set helps in generating queries less likely to have strong language-priors (though as seen in qualitative examples, some priors are still present).
• After the queries are generated, create lemmatized verbs, and nouns set for each query, and store the video segment ids in a dictionary. This is similar to the process used in (Sadhu et al., 2020), with the difference that we additionally have query-tokens.
• For each query, use the dictionary to sample set of video segment ids which share the same semantic role structure, but for the query-token have a different answer. These are used for matching when computing the scores for the validation and testing set using the contrastive score.
3. Creating Train/Test Splits: • Keep the training set for each dataset the same.
• For validation and testing, we split the dataset based on the video ids (half video ids are set as validation, and half as testing). The queries are then split based on the video ids.
• Note that while contrastive sampling is done before validation test split. So validation and test ids are used for computing the other's score for contrastive sampling. This is similar to the setting used in (Sadhu et al., 2020) as the total number of videos available for validation, and testing are insufficient for contrastive sampling.

A.3 Dataset Statistics
Dataset statistics can be found in Table 1. Lemma distributions are visualized in Figure 1 Overall, we find slightly skewed distribution of Argument roles across the datasets. For instance, ARG0, ARG1 are much more frequent than ARG2 and LOC. Also, since every SRL needs to have a verb (V), the distribution of the videos is the same as the overall. As shown in Table 1, vocabularies in both the train and validation/test sets for each argument role (slot) are reasonably large compared (eg. 60% for ARG1) to the total vocabulary and not too limited. This results is further consistent across both datasets.   (2 26 ) w o m a n ( 2 8 7 ) f i e l d ( 1 9 7 ) m a n ( 1 7 0 ) ro o m (2 2 1 ) side (1 78

B Implementation Details
We first report the implementation details for the metrics (Section B.1). Then, we detail the model implementation details (Section B.2).
ROUGE: we use ROUGE-L which computes the longest common sub-sequence.
CIDEr: we use CIDEr-D implementation which includes idf-weighting.

B.2 Model Implementation
We report all model implementation details. General Settings: Our code is implemented using Pytorch (Paszke et al., 2019). For Transformer, we use the implementation provided in FairSeq (Ott et al., 2019). The vocabulary consists of 5k words for ASRL-QA and 3k words for Charades-SRL-QA. The segment features are of dimension 3072 and 512 for ASRL-QA and Charades-SRL-QA respectively obtained from TSN  and S3D (Krishna et al., 2016).
For all cases, we report the output dimension of MLP. Unless otherwise stated, MLP is followed by ReLU activation.
Decoder: The decoder uses an input of T × 512 (where T refers to the length of the input embedding). Note that for Lang-QAP, T is same as sequence length of the query, for BUTD-QAP T =1, for VOG-QAP, T is number of SRLs * number of segment features. For MTX-QAP, T is sequence length of query + number of segment features. To generate output sequences, we use the usual beamsearch with a beam-size of 2, with a temperature of 1.0.
Encoder: Encoder differs based on the specific model. All encoders are transformer based using 8 attention heads and 3 layers unless otherwise mentioned.
Lang-QAP: The language encoder uses 3 encoding layers, with 8 attention heads each. The embedding layer uses a dimension of 512.
BUTD-QAP: We use the same language query, with and pre-pend a [CLS] token. The embedding of the [CLS] token serves as the language embedding, and is passed through a MLP of dimension 512. The language encoder is the same as Lang-QAP. The segment features are passed through MLP of dimension 512. If proposal features are used, they are passed through a separate MLP of dimension 512. The language embedding 8 https://github.com/antoine77340/S3D_HowTo100M (also of dimension 512) is used to compute attention score with the visual features, and finally obtain an attended visual feature. These attended visual features are concatenated with the language embedding along the last axis, and then passed to the decoder.
VOG-QAP: We use the same language encoder, but further use the SRL phrase start and endpoints for the phrase encoder. The phrase encoder uses these start and end points to gather the language embeddings corresponding to these start and end points, concatenate them (dimension 512+512=1024) and use MLP with dimension 512. This gives an output of the phrase encoder of size number of SRLs * s512. The phrase encoded query is then concatenated with all the segment features and passed through a MLP. Finally a multi-modal transformer encoder is applied over the phrase encoded input, and is passed to the language decoder.
MTX-QAP: We collate all the language tokens (passed through embedding layer) as well as segment features passed through MLP, to get all features of dimension 512. A transformer based encoder is applied on these features, and the output is passed to the decoder.
Training: We train using standard cross-entropy loss. The decoder is trained using teacher forcing. All models are trained for 10 epochs with batch size of 32. On a TitanX, for ASRL-QA each epoch takes around 30 − 40 mins. Our training infrastructure included a 8 GPU Titan X machine

C Visualization
We visualize the model outputs on ASRL-QA in Figure 2  For each case, we show the considered input in the first row, and the contrastive sample in the second row. Each row contains 5 frames uniformly sampled from the video segment to be representative of the content observed by the model. For every query, we show the ground-truth answer and the outputs from Lang-QAP, BUTD-QAP, VOG-QAP and MTX-QAP.
Overall, we often find Lang-QAP suggesting very probable answers, but as expected they are not grounded in the video. As a result, in either of the original sample or the contrastive sample, it performs poorly.