Video2Commonsense: Generating Commonsense Descriptions to Enrich Video Captioning

Captioning is a crucial and challenging task for video understanding. In videos that involve active agents such as humans, the agent's actions can bring about myriad changes in the scene. Observable changes such as movements, manipulations, and transformations of the objects in the scene, are reflected in conventional video captioning. Unlike images, actions in videos are also inherently linked to social aspects such as intentions (why the action is taking place), effects (what changes due to the action), and attributes that describe the agent. Thus for video understanding, such as when captioning videos or when answering questions about videos, one must have an understanding of these commonsense aspects. We present the first work on generating commonsense captions directly from videos, to describe latent aspects such as intentions, effects, and attributes. We present a new dataset"Video-to-Commonsense (V2C)"that contains $\sim9k$ videos of human agents performing various actions, annotated with 3 types of commonsense descriptions. Additionally we explore the use of open-ended video-based commonsense question answering (V2C-QA) as a way to enrich our captions. Both the generation task and the QA task can be used to enrich video captions.


Introduction
When humans watch videos they can typically understand and reason about various aspects of the scene beyond the visible objects and actions. This involves understanding that some objects are active agents that not only perform actions and manipulate objects, but are motivated by intentions, have pre-conditions, and that their actions have an effect on the world and their own mental states. For instance, in analyzing the video clip in Figure 1, * Equal Contribution humans employ various capabilities such as perception, reasoning, inference, and speculation, to come up with a description for the observable sequence of events, but also reason about latent aspects such as the intention of the group of runners "to win the medal", the effect of being "congratulated at the finish line", and the attribute "athletic".
The above example also illustrates that recognition of objects, actions, and events is often not enough; understanding causal relationships, social interactions, and commonsense aspects behind them provides context and a more semantic interpretation of the video (Gupta et al., 2009). A model that can provide such detailed interpretations facilitates answering inferential questions, such as "Will the player get angry later?". However, existing visual understanding systems are unable to perform such tasks that require speculative reasoning. A critical missing element in complex video understanding is the capability of performing commonsense inference, especially a generative model. Existing efforts seek to find textual explanations or intentions of human activities as a classification task (Vondrick et al., 2016) or a vision-to-text alignment problem (Zhu et al., 2015). In this paper we propose the Video to Commonsense (V2C) framework to generate visually grounded commonsense descriptions about the underlying event in the video, enriching the factual description provided by a caption. Under this framework a system is expected to generate captions as well as three types of commonsense descriptions (intention, effect, attribute) directly from an input video. The V2C model can also be used as a building block for downstream tasks such as video question answering for questions requiring commonsense. Inspired by (Bosselut et al., 2019), our model -the "V2C-Transformer" utilizes: (1) a video encoder to extract global representations of the video, (2) a transformer decoder that generates Group of runners get prepared to run a race.

Commonsense-Enriched Caption
In order to win a medal, a group of runners get prepared to run a race. As a result they are congratulated at the finish line. They are athletic.

Commonsense Question Answering
What happens next to the runners? { Are congratulated at the finish line become tired captions and commonsense descriptions, and (3) a cross-modal self-attention module that exploits joint visual-textual embeddings. We curate the V2C dataset for training and benchmarking models on this task. We adopt the MSR-VTT video description dataset (Xu et al., 2016) as a source of videos and captions. We first utilize the ATOMIC machine commonsense dataset (Sap et al., 2018) to get a list of candidate commonsense texts (intentions, effects, and attributes), and rank these using a BERT-based (Devlin et al., 2019) model. Since these candidates are retrieved without using the video and may not be accurate, we instruct humans to watch the videos and select, remove, or rewrite the texts retrieved from ATOMIC. The text retrieved by ATOMIC helps our human annotators to understand the format of desired annotations, and also gives them a list of suggestions. The human component in our annotation procedure makes our data visually grounded and relevant, linguistically diverse, and natural.
We additionally explore the use of our V2C-Transformer architecture for a open-ended video question answering task, where the questions are about commonsense aspects from the video. For this, we create a QA addendum of the V2C dataset called V2C-QA. By asking questions about the latent aspects in the video, our models are able to enrich caption generation with three specific types of commonsense knowledge.
Our contributions are summarized below: 1. We formulate the "V2C" task for enriching video captioning by generating descriptions of commonsense aspects. 2. We curate a video dataset annotated with captions and commonsense descriptions.
3. We present our V2C-Transformer architecture that generates relevant commonsense descriptions, and serves as a strong baseline. 4. We pose V2C as a video question answering task and show that it can assist commonsense caption generation.
2 Video to Commonsense (V2C) Problem Formulation: Consider a video V consisting of N v frames described by sentence S . Our Video-to-Commonsense (V2C) framework can be used for generating commonsense descriptions C under two settings. In the first setting (V2C-Completion), we use ground-truth captions to guide commonsense-enriched caption generation. This task can be viewed as providing supplementary explanations to the caption. In the second setting (V2C-Generation), we first learn to generate captions from videos, g(V ), and then use them to generate commonsense descriptions. (1)

V2C-Transformer
The proposed Video2Commonsense Transformer is a cross-modal model that generates captions and commonsense-enriched descriptions from videos. Our approach (Figure 2) adopts the "encoderdecoder" design: a video encoder that extracts global representations of the input video, and a transformer decoder that produces relevant commonsense knowledge along with captions. Decoder: The video encoding is used as input to two decoder networks that use a transformer language model (Radford et al., 2018) to generate a caption and commonsense description, using an inference mechanism similar to Bosselut et al. (2019). Our model is a two-stage process that first predicts the current events directly from videos, and then produces the corresponding commonsense captions. During training, the caption decoder D CAP takes the video encoding (v) and ground truth caption (s) as input to generate caption encoding (ŝ), while the commonsense decoder D CMS uses the concatenation of video and caption encoding to obtain the commonsense description (c), as shown in Figure 1 (b). This arrangement enables the attention module in commonsense decoder to attend to both the video and caption context.
Transformer Decoder is composed of a stack of transformer blocks (dashed area in (c) Figure 2), whose main component is a self-attention architecture. It takes as input the summation of word embedding and the positional encoding offset by 1 position through masked multi-head attention, which prevents the future words been seen. In our model, we deploy two stacked decoder architectures for both caption decoding and commonsense knowledge decoding. The Transformer Block consists of consecutive linear transformation: a multi-head attention module (denoted as H M-ATT ), a two-layer feed forward network (H FFN ), a layer normalization operation, and a residual connection.
Multi-head Attention module To enable our transformer decoder to generate commonsense descriptions by using both the visual and textual content, we modify the multi-head attention module (which acts as the basic unit in recent transformer based language generation models (Radford et al., 2018, 2019)) as a cross-modal module. H M-ATT takes the input of the embedding of key (K), value (V) and query (Q). The key and value in transformer block are the video encoding (caption decoder) or concatenation of video/caption encoding (commonsense decoder), while the query is the output from the previous transformer block. In the masked multi-head attention module, K, V and Q are the identical vectors of input embedding. For a self-attention block with h heads,

Caption:
A soldier fights with his enemy.  Figure 3: The overall three-step pipeline (retrieval from ATOMIC, BERT re-ranking, and human labeling) to construct our V2C dataset.
where x i is computed by scaled dot-product attention operation, for head-index i, key-dimension d k n, and transformation parameters W i .

The V2C Dataset
For the V2C task we need video clips annotated with commonsense descriptions about the agents in the video, as shown in Figure 1. While there are video captioning datasets such as MSR- VTT (Xu et al., 2016), the captions in these datasets describe only the observable objects in the image, but do not describe latent and commonsense aspects. We are the first to curate such a dataset with annotations describing the intention of agent to perform an action, the effect of the action and the attribute of the agent given the action. video, thus making it inappropriate to just evaluate caption generation using BLEU scores.

MSR-VTT
ATOMIC (Sap et al., 2018) is an atlas of everyday commonsense knowledge and contains 880k triplets about causes and effects of human activities, organized as if-then relations, annotated by crowd-sourced workers. This data can be categorized based on causal relations, thereby giving us the categories "cause", "effect" and "attribute", e.g., "if X wants to relax, then he will play video game."

Querying from ATOMIC and Re-ranking
Since inferential knowledge in ATOMIC only covers human activities, we first retain only those captions in Msr-vtt that describe human activities. We then select three queries from ATOMIC most similar to the caption, and extract the commonsense descriptions corresponding to these queries. In order to select a more reasonable subset of commonsense descriptions, we first train a ranking model. We use the BERT (Devlin et al., 2019) architecture for the ranking model, trained on the ATOMIC dataset for a binary classification task, to predict the relevance of a candidate commonsense description with respect to the event. We select the top three relevant intentions, effects, and attributes for each caption. This allows us to obtain a preliminary set of 9 commonsense annotations per video directly from the ATOMIC dataset, relevant to the caption, albeit with noise and annotations that are not relevant to the video.

Detailed Human Annotation
Since we do not use the video to retrieve commonsense descriptions from ATOMIC, we employ human workers to annotate our dataset. We recruit  Table 2: Evaluation of V2C completion task using CIDER, BLEU, Perplexity, Rouge, and Meteor metrics. We use only BLEU-1 to evaluate the attribute generation since the average length of the ground truth is just less than 2.
two sets of human workers to watch the video, read the caption and select/annotate the relevant commonsense descriptions for each video. The first set is Amazon Mechanical Turkers (AMT) who select relevant descriptions. The second set is skilled human annotators, screened from a set of university students proficient in English, who are asked to provide annotations in their own words, and remove or edit irrelevant annotations that were provided by ATOMIC and AMT workers. This makes our annotations not only grounded in the video, but also more descriptive, linguistically diverse, and of higher quality (see Figure 3). The descriptions from ATOMIC, although not relevant to the video in some cases, give our workers an idea about the format of annotations desired. The skilled humans reported that 95% of the captions were relevant, and 65% of the ATOMIC descriptions were useful in understanding the annotation task. Through this procedure, we obtain 6819 videos for training and 2906 videos for testing, a total of 121,651 captions (∼12 captions/video), each caption accompanied with 5 commonsense knowledge annotations (V2C-Raw set). In experiment, we use video captioning technique to conduct the V2C completion task on V2C-Raw set. In addition, we instruct human annotators to select and rewrite one raw phrase into complete sentences that complement the captions. In total we have 3 complete sentences per video for intention/effect/attribute respectively, and this yields a subset that allows our model to generate complete story-like sentences (V2C-Clean Set). Table 1 shows examples from the newly compiled dataset. We conduct rigorous human evaluation to evaluate the quality of our V2C dataset ("Gold Annotations" in Table 3). Details about the dataset creation process and quality control mechanisms can be found in the Appendix.

Experiments
In this section we describe the loss function used for training our model, additional details about video pre-processing, hyper-parameters, and baseline models, and the metrics used for evaluation.
Loss Function: The decoder parameters Θ are trained to maximize the log-likelihood over the training set given by y t denotes the one-hot vector probability of each word at time t, and N S , N C denote the length of the caption and commonsense respectively.
Setting: In order to obtain video representations, we uniformly sample 40 frames from each video and extract features using feed ResNet (He et al., 2016) pre-trained on Imagenet ILSVRC12 dataset (Deng et al., 2009) and get a 2048-d output from the last layer. We use one-hot input (1-of-N encoding) of the text input and pass it through an embedding layer to produce a 1028-d hidden vector. We use independent vocabularies for captioning and commonsense generation with sizes 27,603 and 24,010 respectively. Note that, as the generated Hyperparameters: Our decoder is a lightweight transformer decoder consisting of 6 transformer blocks with 8 attention heads each. We use Adam optimizer with 5000 warm-up steps, and learning rate initialized at 1e-4, and a dropout probability of 0.1 after the residual layer. Our model is trained on a machine with single NVIDIA 1080-Ti GPU.  Table 3: Human evaluation scores for V2C. Captions are an input for the V2C-Completion task, and generated for the V2C-Generation task. The best model is given in bold, while the overall best is underlined.  , 2004), and perplexity score of the generation on its corpus. We further conduct human evaluations using AMT workers, who are asked to identity whether the generated commonsense justifiably completes the events (V2C-completion). We follow the setup in (Sap et al., 2018) and randomly sample 100 videos from test set and collect 10 generations for each. To guarantee the objectiveness of the human evaluations, we hire 5 workers for each sample, yielding 30k ratings in total for each model.

Results
Natural Language Generation Metrics: We show evaluation of the commonsense comple-tion task in Table 2 . We would like to highlight that our transformer model is light-weight with only half of the parameters in GPT without any pre-training. We evaluate our model on the tasks of caption generation with human evaluations, and also compare it with the gold annotations. Our gold annotation for ground-truth captions (sourced from the MSR-VTT dataset) points to the fact that a small percentage of captions from MSR-VTT are not relevant to the video, and this is amended by our human workers.
For the V2C-Completion task, our V2C-Because she wants to serve healthy meals, , and she will have food ready to eat soon. The person is seen as skilled with their hands.
Because she wants to express themselves, the woman is singing a song and playing piano, she will enjoy playing piano. The woman is artistic.

Effect Attribute
To know how to play soccer, a man is playing a soccer game, and he will cautiously dribble the ball. The man is enthused.

Completion:
GT Caption: A woman making fish shaped food with bean paste.

Intention Effect Attribute
To catch a fish, a baby is talking about a fish in the ocean, and he will know more about the ocean. The person is seen as knowledgeable. Transformer model is substantially better (by 7.73%) than the LSTM-based model from (Gao et al., 2017), and shows consistent lead on each dimension. Thus, when the ground-truth caption is given, our model is able to generate much more relevant commonsense descriptions, thereby consolidating it's ability of commonsense generation. For the task of V2C-Generation, the difference between human scores for LSTM vs V2C-Transformer is reduced, but our VTC outperforms on average by 2.98%. This may be attributed to the fact that the LSTM-based model is slightly better at generating captions.

Generating Textual Stories with Commonsense
In order to generate story-like textual descriptions that complement the factual captions, we additionally train our model to exploit our diverse completesentence annotations. Specifically, instead of producing the commonsense knowledge given the videos and captions, we finetune our pre-trained V2C-Transformer model on predicting the human rewritten texts, and generate complete story-like captions. Since we do not have enough annotations per sample to compute a fair BLEU score for comparisons, we showcase some sample generated descriptions for qualitative analysis (see Figure 4). With that, we observe V2C-Transformer is able to produce complete stories that contain simple, while logically consistent storylines that complement both the visual content and the factual descriptions. We believe that collecting a set of story-like sentences will further enrich our models, and allow us to generate much more contextual, creative, and

V2C-QA
Another way of generating commonsense descriptions about the video is by asking pointed questions. Consider the example in 1 where we ask the question "What happens next to the runners", about the effect of the action "prepare" performed by the agents "group of runners" observed in the video. We propose a V2C-QA -an open-ended commonsense video question-answering task, where we ask questions about the intents, effects and attributes of the agents in the video. Dataset: We use the caption and commonsense annotations in the V2C dataset to create questionanswer pairs for each video. We first extract the action and subject from the caption using SpaCy linguistic features (Honnibal and Johnson, 2015). For each intention, attribute and effect for a video, we use template-based generation to get 7 types  of questions -yielding 21 questions per sample, including negative questions as in Gokhale et al. (2020). In total, we have 1,250 training videos and 250 test videos, and a total of 37k questions. We have a set of 5,555 unique answers for our questions. Each question can have multiple possible true answers as shown in the example in Figure 5. The V2C-QA task asks questions that require commonsense reasoning about internal mental states, motivations, and latent aspects of agents in the video as opposed to the conventional video-QA questions about visible objects and actions.

Models:
We utilize our V2C-Encoder followed by an open-ended answering module. We jointly predict the type of the question and combine it with the V2C encoding using a feed-forward network. For textual features, we use embeddings from BERT-base (Devlin et al., 2019). Our models are trained on the open-ended QA task and setup as a multi-label classification task similar to VQA (Antol et al., 2015), with an answering module design inspired by LXMERT (Tan and Bansal, 2019). Our loss function includes the classification loss for answering, the attention loss for questiontype, and a label-ranking loss.
Results: MSR-VTT QA (Xu et al., 2017) is as a good baseline since it is trained on a conventional videoQA task on the MSR-VTT videos, and only takes video and query as input, unlike recent video understanding models (Lei et al., 2018) that take additional supervision, such as subtitles. However this model is trained for a multiple-choice QA scheme, so we modify it with our open-ended answering module. We compare our models when we use our encoder pretrained on the V2C caption generation task, and then finetune it on the V2C-QA task. We also train models with ground-truth factual captions as input. Our results are shown in Table 4, where we evaluate on prediction of topk (1,3,5) answers, and report precision and recall.
Our encoder pre-trained on the V2C task outperforms all other models. Attribute-related questions are easier to answer, while the models struggle the most for questions about intention. Captions help in questions about effects. The overall text-only baseline shows an insignificant bias between the question and answer-options.

Outlook
A video typically contains one or many objects (sometimes performing actions) in different back-grounds, scenes, or situations. Some objects may be "passive" such as trees or buildings, while some objects may be "active" such as people performing actions like walking, singing, and driving. This paper is focused on describing such active agents in terms of their intentions, effects of their actions, and attributes that characterize these agents. We distinguish V2C from the traditional video captioning task. Video captions describe observable objects, background, and actions, while commonsense descriptions in our task seek to describe the unobservable intentions of the agent (pre-conditions or mental conditions), effects of the action (that happen in the future), and attributes which characterize the agent. Thus commonsense generation goes beyond the visible. Ours is the first attempt at developing a generative video-based commonsense model. We anticipate that our framework can be utilized for many applications in video understanding, comprehension, human-robot interaction, and learning commonsense in a multimodal setting.

Conclusion
In this paper, we explore a novel and challenging task to generate video descriptions with rich commonsense descriptions that complement the factual captions. We expand an existing video captioning dataset for the V2C task through automated retrieval from a textual commonsense corpus followed by human labeling, and present a novel V2C-Transformer model to serve as a strong baseline method for the V2C task. Our evaluation verifies the effectiveness of our method, while also indicating a scope for further study, enhancement, and extensions in the future. Our experiments on using the V2C-Transformer as a component for the V2C-QA task show that the model has transfer learning capabilities that can be applied to other vision-andlanguage tasks such as question-answering, that require commonsense reasoning. Our dataset creation methodology is a three-step procedure as shown in Figure 9. In the first step, we use the caption to query ATOMIC (Sap et al., 2018) and retrieve the top-3 intentions, effects, and attributes. These are re-ranked by a BERT based model in the second step. The final step involves humans in the annotation process. We ask human annotators to select the most relevant descriptions, and to provide additional descriptions in their own words. The annotators also convert a subset of our dataset into complete sentence descriptions.

A.1 Querying from ATOMIC
For every video-caption pair in the MSR-VTT dataset, we select 3 most similar events from ATOMIC. These are then used to retrieve textual descriptions of three types -intentions, effects, attributes from ATOMIC.

A.2 BERT Ranking Model
We implement a Bidirectional Encoder Representations from Transformers (BERT) model (Devlin et al., 2019) as a ranking model to rank and retrieve top-3 most plausible commonsense aspects to complement the ground truth caption. This is done by treating the ranking task as a binarized next sentence prediction (  we have 115,312 pairs for training/testing. We evaluate our model using accuracy of the prediction in the test set of ATOMIC which is 30% of the entire set. BERT can achieve 86.21% accuracy in NSP task on average as shown in Table 5. In addition, we also conduct human evaluations to measure the overall quality of the expanded V2C dataset (see "gold annotations" in Table. 3, main paper).

A.3 Human Labeling
With querying from ATOMIC and BERT reranking, we obtain commonsense descriptions that are relevant to the caption. However, we want to make sure that these descriptions are also relevant to the video. Thus we utilize human workers from Amazon Mechanical Turk (AMT) for selecting the most relevant commonsense descriptions. Our annotation interface is shown in Figure 10. We ask the annotators to select descriptions that are most relevant to the video and to the caption, and also encourage them to add their own commonsense descriptions. This makes our dataset more natural and human-like. This also allows us to remove noisy annotations that may be produced due to text-only ATOMIC querying. We show additional samples from our V2C dataset in Figure. 11, word cloud in Figure. 7 and word frequency in 8.

A.4 Benefits of the Three-Step Pipeline
Since our videos are annotated with captions, we use the captions to retrieve commonsense descriptions from ATOMIC. The ATOMIC dataset has comprehensive annotations for human activities, actions, and events and as such covers most of the events in MSR-VTT. Thus using these two datasets together is a natural step for creating our V2C dataset. This purely caption-based retrieval unfortunately does not incorporate the latent aspects of the video, but only those from the caption. Moreover, since the video is not used for retrieving these, the commonsense annotations may be out-of-context. Thus, we bring in human annotators to watch the video, read the caption, and then use the set of descriptions from ATOMIC to select the relevant once and to discard the irrelevant or out of context descriptions. The human annotators then provide annotations about intention, effect, and attribute in their own words. The ATOMIC retrieved descriptions help the human annotators to get an idea about the task and also get a glimpse of the format of the desired annotations. This significantly reduces the noise in human annotations.
To guarantee and measure the overall quality of our V2C dataset, we have conducted human evaluations on the V2C annotations. Our results shows that 86.29% of the video-caption-commonsense triples are labeled as reasonable samples (see "Gold Annotations" in main paper, Table. 3), verifying the quality of our dataset

B V2C-QA Dataset
For the V2C Question Answering task, we repurpose our V2C dataset and convert it to a questionanswering dataset. We choose a subset of 1500 videos: 1250 for training and 250 for testing, following the same train-test split as MSR-VTT. We use SpaCy linguistic features (Honnibal and Montani, 2017) along with the LemmInflect library (https://github.com/bjascob/LemmInflect) and templatebased generation to convert the captions, intentions, effects, and attributes from V2C to create questions and ground-truth answers. Our templates are lingustically diverse, natural, and grammatically sound. We have 21 types of templates with each template having numerous possibilities for combi-  nations of the slots in the template. Thus we get 21 types of questions (7 each for intention, effect, and attribute) as shown in Table 6. Since our task is open-ended question-answering, our questions are annotated with all possible correct answers for that question. To get answers for the "negative" questions as shown in Table 6, we use the adversarial matching strategy similar to (Zellers et al., 2019), by using RoBERTa (Liu et al., 2019) similarity. We will release our V2C-QA question and Figure 9: The data creation flow for V2C. We use the retrieved videos and captions from MSR-VTT and use the BERT re-ranking module to obtain a list of top-3 intentions (I), effects (E), and attributes (A). These are then further improved by human labeling. A subset of annotations is also converted to full sentences by human annotators.
Figure 10: Our human labeling interface. We ask human workers to select relevant commonsense descriptions as well provide additional texts in their own words answer generation code publicly.

C Qualitative Generation Results
We show additional V2C-Completion samples by our V2C-Transformer model in Table. 7.

D Human Evaluation
Human evaluation is one of the important part to verify the performances of our model and the quality of the V2C dataset. In this section we describe our setup for human evaluation of the captions and commonsense descriptions in our dataset as well as those generated by our models.

D.1 Amazon Mechanical Turk Interface
We conduct our human evaluations by crowdsourcing ratings from workers on Amazon Mechanical Turk (AMT). We do these human evaluations on the same test set used for our automated metrics. We show an example of our interface in Figure 12 and 13 which shows the screenshot of the rating task as seen by the workers. The workers are given explicit instructions about this rating task, and depending on the task are asked to rate the commonsense descriptions and the caption.
For the V2C-Completion task, the workers are provided with the video and the ground-truth caption and asked to rate the only the generated commonsense (intention, effect or attribute) on a scale of 1 to 5. The workers are asked to provide this rating on the basis of whether the generated text is relevant to the video, i.e whether the caption/commonsense can plausibly complete the given event.
For the V2C-Generation task, the workers are asked to rate the caption as well as the commonsense texts with respect to the video. The workers are also asked to conduct identical tasks for the gold (ground-truth annotations) in our new V2C dataset.

D.2 Scheme for Validity
Our ratings are measured on a scale of 1 to 5. Annotations which receive a score greater than 3 are considered "valid", so as to be consistent with the binary ratings used by (Bosselut et al., 2019) for their experiments. We then compute average validity scores for each commonsense aspect: intention, attribute and effect.

D.3 Statistics of Human Evaluations
In order to further analyze the human evaluations on our generated outputs, we use three metrics -standard deviation of the ratings, inter-rater agreement score (IRAS) and a smooth version of IRAS. Standard Deviation was calculated per sample based on the evaluations provided by multiple workers on each sample. We do so to evaluate how consistent our AMT workers are and how much they deviate or agree with each other. We use three different metrics so as to analyze our data and generations through multiple lenses, to be certain that the outputs and annotations are high-quality.

D.3.1 Inter-Rater Agreement Score
Inter-Rater Agreement Score is computed as the average of the percentage of raters for each sample that agree with the majority opinion. Let m be the size of the test-set, and n be the number of rating. Let R j = {r 1 , . . . , r n } be the set of ratings for test sample j. Then the mode r mode is defined as the most frequently occurring (majority) rating in the set of ratings R j , i.e. r mode = MODE(R j ).
Inter-Rater Agreement Score IRAS agree is the average percentage of raters that agree with the majority opinion r mode : where I is the indicator function.

D.3.2 Smooth Inter-Rater Agreement Score
While IRAS acts as a good metric to find out how our dataset fares in terms of rater agreement, it suffers from a flaw. Irrespective of the value of ratings, the indicator function I returns 0 for the tuple of ratings (1, 5) as well as (4, 5), although the   ratings of 4 and 5 are close to each other but 1 and 5 are opposite. So to avoid this, we replace the indicator function with a smooth exponential term. The smooth inter-rater agreement score is given by: 1 2 |r i −r mode | n × m . Table 8 shows our analysis in terms of the three metrics described above. Our V2C-Transformer architecture consistently outperforms the baseline model ATTENCDEC (Gao et al., 2017) in all three metrics for each type of commonsense. This means that raters are more consistent with their ratings (in terms of deviation or agreement) for commonsense descriptions generated by our model.

Caption
A guy sings a song in a music video Because he wants to express himself, a guy sings a song in a music video, and he will upload it to YouTube soon. He is quite an enthusiastic guy.

Rewritten Story:
In order to purchase new sportswear, girls trying on new sports bra, and they may grab attention from people later. They are all stylish person.

Rewritten Story:
Since the athletes are trying to win the race, groups of runner get prepared to run a race, and they will run and get congratulated at the finish line soon. They are athletic.

Rewritten Story:
To show his appreciation to the winners, President Obama calls a team to congratulate them, the girls will got sweats because of that. The Obama is so talkative.
Caption A group of males speaking to each other at a meeting.
to have a conversation convey information to give speech gives a rebuttal gets to meet the host loses their voice due to loud talking/yelling extroverted polite speaker

Rewritten Story:
In order to convey with each other the information, a groups of males speaking to each other at a meeting, they will get into a rebuttal soon. The people have the attribute to be extroverted.

Caption
A man drives a vehicle through the countryside.
to get to her destination to get somewhere to drive fast travels to a different city arrives at their destination enjoy driving traveling a good driver speedy Caption A woman in a business suit looking at a computer monitor. To get to his destination as soon as possible, a man drives a vehicle through the countryside, he may soon arrives at his destination. The man is a good driver.

Rewritten Story:
Because the computer is not working and the woman is trying to fix it, a woman in a business suit looking at a computer monitor, she will boots the computer first soon. She is a very informative person.