A Dataset for Telling the Stories of Social Media Videos

Video content on social media platforms constitutes a major part of the communication between people, as it allows everyone to share their stories. However, if someone is unable to consume video, either due to a disability or network bandwidth, this severely limits their participation and communication. Automatically telling the stories using multi-sentence descriptions of videos would allow bridging this gap. To learn and evaluate such models, we introduce VideoStory a new large-scale dataset for video description as a new challenge for multi-sentence video description. Our VideoStory captions dataset is complementary to prior work and contains 20k videos posted publicly on a social media platform amounting to 396 hours of video with 123k sentences, temporally aligned to the video.


Introduction
Telling stories about what we experience is a central part of human communication (Mateas and Sengers, 2003). Increasingly, stories about our experiences are captured in the form of videos and then shared on social media platforms. One goal of automatically understanding and describing such videos with natural language is to generate multi-sentence descriptions which convey the story, making them accessible to situationally (e.g. bandwidth) or physically ("blind") disabled people. However, it is still a challenge for vision and language models to automatically encode and describe temporal content in videos with multi-sentence descriptions (Rohrbach et al., 2014;Zhou et al., 2018b). To better understand the stories shared on social media we collect and annotate a novel dataset consisting of videos from a social media platform. Importantly, we collect descriptions containing multiple sentences, as single sentences would typically not be able to capture the narration and plot of the video.
We introduce a large-scale multi-sentence description dataset for videos. To build a dataset of high quality, diverse and narratively interesting videos, we choose videos that had high engagement on a social media platform. Existing video captioning datasets, such as ActivityNet Captions  or cooking video datasets (Regneri et al., 2013;Zhou et al., 2018a), have focused on sets of pre-selected human activities, whereas social media videos contain a great diversity of topics. Videos with high engagement tend to be narratively interesting, because humans find very predictable videos less enjoyable, meaning that captioning of the videos accurately requires integrating information from the entire video to describe a sequence of events (see Figure 1). Together, this creates a diverse and challenging new benchmark for video and language understanding.
We present a thorough analysis of the new benchmark, demonstrating that linguistic and video context is crucial to accurate captioning and that the captions have a temporal consistency. We also show baseline results using state-of-the-art models.

Multi-Sentence VideoStory Dataset
In Table 1 we summarize existing video description datasets; most provide only single-sentence descriptions or are restricted to narrow domains. Other multi-sentence description datasets are proposed for story narration of sets of images taken from a Flickr album (Huang et al., 2016;Krause et al., 2017). Other related work includes visual summarization of Flickr photo albums (Sigurdsson et al., 2016a) or videos (De Avila et al., 2011; where the idea is to pick the key images or frames that summarize the visual content.
Two little girls are riding on the horse backs.
One of the horses starts to wallow in the puddle, throwing the girl into the muddy water.
The little girl is getting a hold of herself, the girl on the other horse continues to laugh at the whole incident.
A dog joins her in the puddle, while the horse stands up and shakes off the water on it.
She then smiles and grabs the horse.
The man who walked her down the aisle steps away towards the side of the room as the couple take each others arms.
A large group of people have gathered inside of a room for a wedding.
A woman walks down the aisle with a man slowly as people watch.
The two of them get to the end of the aisle where a groom stands waiting.
The man shakes hands with the groom and gives the woman a kiss on her forehead.   (Das et al., 2013) cooking 88:-95s 2.7k --VTW (Zeng et al., 2016) open 18k:-90s -45k --TGIF (Li et al., 2016) open :100k 3s -128k --MPII MD  movie 94:68k 4s -68.3k ( ) ( ) M-VAD (Torabi et al., 2015) movie 92:46k 6s -55.9k ( ) ( ) LSMDC (Rohrbach et al., 2017) movie 200:128k 4s -128.1k ( ) ( ) TACoS (Regneri et al., 2013) cooking 127:3.5k 286s: 11.8k -TACos multi-level (Rohrbach et al., 2014) cooking 185:25k 307s: 67 75k -Youcook II (Zhou et al., 2018a)   We select videos posted on a social media platform to create our dataset because of the variability in topics, length, viewpoints, and quality. They also tend to represent a good distribution of stories communicated by humans. We select videos from social media that are public and popular with a large number of comments and shares that triggered interactions between people. In total, our dataset consists of 20k videos with duration ranging from 20s-180s and spanning across diverse topics that are observed on social media platforms. We follow  to create temporally annotated sentences where each task is divided into two steps: (i) describing the video in multiple sentences, covering objects, situations and important details of the video; (ii) aligning each sentence in the paragraph with the corresponding timestamps in the video. We refer to these as video segments. In Figure 1, we present two example annotated videos describing (i) a scene where two girls are playing with horses; (ii) a wedding with a bride walking down the aisle.
We summarize the statistics of our dataset in Table 2 and compare it to prior work in Table 1. Each of the 20k videos in our VideoStory dataset is annotated with a paragraph which has on average 4.67 temporally localized sentences. As we have three paragraphs per video for validation and test set, we have a total of 26,245 paragraphs with a total of 123k sentences. Each sentence in the dataset   (1039). Each video in the training set has a single annotation, but videos in validation, test, and blind test splits have three temporally localized paragraph annotations, for evaluation. While the test set can be used to compare model variants in a paper, only the best model per paper should be evaluated on the blind test set annotations, which will only be possible on an evaluation server. Annotations for the blind test set will not be released.
To explore the different domains in our dataset vs. ActivityNet captions we use the normalized pointwise mutual information to identify the words most closely associated with each dataset. Highest ranked words for ActivityNet are almost exclusively sports related, whereas in our dataset they include animals, baby, and words related to social events such as weddings. Most dominant actions in ActivityNet are either sports or household activity related whereas actions in our dataset are related to social activities such as laughing, waving, cheering etc. Our analysis of the distribution of POS categories show that nouns are the most dominant category observed in the VideoStory captions dataset with 24% of the total tokens followed by verbs (18.5%), determiners (15.9%), adjectives (4.36%), adverbs (5.16%) and propositions (5.04%). We start end also observe the similar distribution of POS categories in ActivityNet captions. We also find that ActivityNet has 50% of the videos where at least one segment in the video describes more than half of the video duration whereas in our dataset only 30% of videos have that phenomenon. In Figure 2, we show the distribution of sentence/segment annotations in time. The average number of (temporally localized) sentences is 4.67 compared to 3.65 in ActivityNet, despite having shorter videos, indicating the high information content of our videos.
In Table 3 we present all three paragraph annotations for a video showing a wedding ceremony. Out of 3 annotations, Annotation 2 is more descriptive compared to 1 and 3. However, it misses details about the presence of the photographer and taking the pictures.
Temporal Analysis. High quality video descriptions are more than bags of single-sentence captions; they should tell a coherent story. To identify the importance of sentence ordering or temporal coherence in our video paragraphs, we train a neural language model (Merity et al., 2017) on the training paragraphs of the VideoStory dataset and report perplexity on the correct order of sentences vs. randomly shuffled order of sentences in the descriptions created to understand the importance of temporal coherence in the video descriptions of our dataset. Results in Table 2 show that shuffled sentences have higher perplexity scores, demonstrating that order of sentences in the paragraphs are important for the coherence in the story.

Baseline Captioning Models
We explore learning to caption the videos using ground truth video segments.

Image Captioning Models. To understand if the temporal component of the video is contributing
Annotation 1: A bride walks down the aisle to her waiting bridegroom. As the bride walks, a photographer captures photos. At the end of the aisle the man giving the bride away shakes hands and hugs the bridegroom. The bride and bridegroom then interlock arms and face forward together. Annotation 2: A large group of people have gathered inside of a room for a wedding. A woman walks down the aisle with a man slowly as people watch. The two of them get to the end of the aisle where a groom stands waiting.The man shakes hands with the groom and gives the woman a kiss on her forehead. The man who walked her down the aisle steps away towards the side of the room as the couple take each others arms. Annotation 3: A groom is standing at the end of an aisle as a photographer takes a photo. The bride and father then come into view and walk down the aisle to the waiting groom. They stop at the grooms spot and the bride's father then shakes the grooms hand and gives a hug and walks to his spot. The groom then holds arms with the bride to begin the wedding ceremony. to the description, we trained image captioning models on a frame sampled from the middle of the each segment of a video. We use the Show and Tell (Vinyals et al., 2015) image captioning architecture to generate captions. Video Captioning Models. We study various video captioning models. First, we use sequence to sequence (seq-seq) recurrent neural network (RNN) model which has a two-layer encoder RNN to encode video features and a decoder RNN to generate descriptions. In the seq-seq approach we treat each description/segment individually and use an RNN decoder to describe each segment of the video, similar to Venugopalan et al. (2015), but using Gated Recurrent Units, GRUs, (Cho et al., 2014) for both the encoder and decoder.
In most videos, events are correlated with previous and future events. For example, for the first video description shown in Figure 1 once the girl is thrown into the water, she gets hold of herself, and the horse shakes off water on her. To capture such contextual correlations, we incorporate context from previous segment description into the captioning module. We build a model (seq-seq + context) which takes current segment video features and hidden representation of previous segment's sentence generation RNN at every timestamp in the decoder. For a given video segment, with hidden encoded video representation h v i and hidden representation of previous segment h s i−1 , the concatenation of (h v i , h s i−1 ) is fed as input to the decoder that describes the segment (shown in Figure 3). Prior work has shown using previous video context has improved generated captions . For video captioning models we extract features from pre-trained 3D convolution ResNext-101 architecture trained on Kinetics (Kay et al., 2017), denoted as R3D, which achieved state-of-the-art results on various activity recognition tasks (Hara et al., 2018). Since a significant percentage of our videos has objects other than humans (e.g., animals) we also experiment with image-video fusion features(denoted by RNEXT, R3D) i.e., concatenation of ResNext-101 features extracted from pre-trained ImageNet with R3D features described above. We extract image features from the same frames which were used to extract R3D features.

Experiments and Results
For every segment, we set the maximum number of the sequence of features to 120 (i.e., 16X120 frames from the video) and maximum sentence length to 30. We trained using Adam optimizer with learning rate 0.0001. We use GRU as recurrent architecture to encode frames and decode captions with 512 dimensional hidden representation. We measure the captioning performance with most GT (Ground Truth): A baby is playing outside with two dogs. The baby rolls the ball and the dog brings the ball back to the baby. The baby tosses the ball again and again for the dogs. One of the dogs walk away but the other stays and plays with the baby.

I (Image):
The dog is standing on the bed . The dog is looking at the dog . The dog is walking on the ground . The dog is walking around the room .
seq-seq (RNEXT,R3D): A dog is walking in the water with a baby . The dog runs up and down the water . The dog runs up and down the slide . A baby is walking around the house with a baby .
seq-seq+context (RNEXT,R3D): A dog is standing in the middle of a house . The dog runs around the room and the dog jumps up and down . The dog is walking on the floor and the dog walks away . the girl runs around the house and the other dog runs away .  We report BLEU (B) and METEOR (M), ROUGE-L(R) and CIDEr (C). Best scores are in bold.
In Table 5, we present the performance of our baseline models on VideoStory test dataset. We observe that models that consider context (seq-seq+context) from the previously generated sentence have better performance than the corresponding models without context (seq-seq), with both 3D convolution based features (R3D) as well as image-video fusion features (RNEXT,R3D). This indicates that our model benefited from contextual information, and that sentences in our stories are contextual, rather than independent.
To validate the strength of our baseline model, we train our best performing model on ActivityNet Captions. It achieves 10.92 (METEOR) and 43.42 (CIDEr) on the val set, close to state-of-the-art results of 11.06 and 44.71 by Zhou et al. (2018b), indicating that it is a strong baseline. However, when evaluating our ActivityNet model on our VideoStory dataset (Table 5, last row), we see significantly lower performance compared to a model trained on our dataset, highlighting the complementary nature of our dataset.
Our image only (single frame) model has the lowest scores across all metrics suggesting that a single image is not enough to generate contextual descriptions. We observed that our fusion models consistently outperform models with video-only R3D features, indicating features extracted using pre-trained ImageNet complement activity based R3D features. We show qualitative results from the variants of our models in Table 4. We observe that single frame models tend to repeat same captions and seq-seq model without context repeats phrases in the descriptions.

Conclusions
This paper introduces a dataset which we sourced from videos on social media and annotated with multi-sentence descriptions. We benchmark strong baseline approaches on the dataset, and our evaluations show that our dataset is complementary from prior work due to more diverse topics and the selection of engaging videos which tell a story. Our VideoStory dataset can serve as a good benchmark to build models for story understanding and multi-sentence video description.