Sort Story: Sorting Jumbled Images and Captions into Stories

Temporal common sense has applications in AI tasks such as QA, multi-document summarization, and human-AI communication. We propose the task of sequencing -- given a jumbled set of aligned image-caption pairs that belong to a story, the task is to sort them such that the output sequence forms a coherent story. We present multiple approaches, via unary (position) and pairwise (order) predictions, and their ensemble-based combinations, achieving strong results on this task. As features, we use both text-based and image-based features, which depict complementary improvements. Using qualitative examples, we demonstrate that our models have learnt interesting aspects of temporal common sense.


Introduction
Sequencing is a task for children that is aimed at improving understanding of the temporal occurrence of a sequence of events. The task is, given a jumbled set of images (and maybe captions) that belong to a single story, sort them into the correct order so that they form a coherent story. Our motivation in this work is to enable AI systems to better understand and predict the temporal nature of events in the world. To this end, we train machine learning models to perform the task of "sequencing".
Temporal reasoning has a number of applications such as multi-document summarization of multiple sources of, say, news information where the relative order of events can be useful to accurately merge information in a temporally consistent manner. In question answering tasks (Richardson et al., 2013; Figure 1: (a) The input is a jumbled set of aligned image-caption pairs. (b) Actual output of the system -an ordered sequence of image-caption pairs that form a coherent story. Fader et al., 2014;Weston et al., 2015;Ren et al., 2015), answering questions related to when an event occurs, or what events occurred prior to a particular event require temporal reasoning. A good temporal model of events in everyday life, i.e., a "temporal common sense", could also improve the quality of communication between AI systems and humans.
Stories are a form of narrative sequences that have an inherent temporal common sense structure. We propose the use of visual stories depicting personal events to learn temporal common sense. We use stories from the Sequential Image Narrative Dataset (SIND) (Ting-Hao Huang, 2016) in which a set of 5 aligned image-caption pairs together form a coherent story. Given an input story that is jumbled ( Fig. 1(a)), we train machine learning models to sort them into a coherent story ( Fig. 1(b)). 1 Our contributions are as follows: -We propose the task of visual story sequencing.
-We implement two approaches to solve the task: one based on individual story elements to predict position, and the other based on pairwise story elements to predict relative order of story elements. We also combine these approaches in a voting scheme that outperforms the individual methods.
-As features, we represent a story element as both text-based features from the caption and imagebased features, and show that they provide complementary improvements. For text-based features, we use both sentence context and relative order based distributed representations.
-We show qualitative examples of our models learning temporal common sense.

Related Work
Temporal ordering has a rich history in NLP research. Scripts (Schank and Abelson, 2013), and more recently, narrative chains (Chambers and Jurafsky, 2008) contain information about the participants and causal relationships between events that enable the understanding of stories. A number of works (Mani and Schiffman, 2005;Mani et al., 2006;Boguraev and Ando, 2005) learn temporal relations and properties of news events from the dense, expert-annotated TimeBank corpus (Pustejovsky et al., 2003). In our work, however, we use multimodal story data that has no temporal annotations.
A number of works also reason about temporal ordering by using manually defined linguistic cues (Webber, 1988;Passonneau, 1988;Lapata and Lascarides, 2006;Hitzeman et al., 1995;Kehler, 2000). Our approach uses neural networks to avoid feature design for learning temporal ordering.
Recent works (Modi and Titov, 2014;Modi, 2016) learn distributed representations for predicates in a sentence for the tasks of event ordering and cloze evaluation. Unlike their work, our approach makes use of multi-modal data with free-form natural language text to learn event embeddings. Further, our models are trained end-to-end while their pipelined approach involves parsing and extracting verb frames from each sentence, where errors may propagate from one module to the next (as discussed in Section 4.3). Chen et al. (2009) use a generalized Mallows model for modeling sequences for coherence within single documents. Their approach may also be applicable to our task. Recently, Mostafazadeh et al. (2016) presented the "ROCStories" dataset of 5sentence stories with stereotypical causal and temporal relations between events. In our work though, we make use of a multi-modal story-dataset that contains both images and associated story-like captions.
Some works in vision (Pickup et al., 2014;Basha et al., 2012) also temporally order images; typically by finding correspondences between multiple images of the same scene using geometry-based approaches. Similarly, Choi et al. (2016) compose a story out of multiple short video clips. They define metrics based on scene dynamics and coherence, and use dense optical flow and patch-matching. In contrast, our work deals with stories containing potentially visually dissimilar but semantically coherent set of images and captions.
A few other recent works (Kim et al., 2015;Kim et al., 2014;Kim and Xing, 2014;Sigurdsson et al., 2016;Bosselut et al., 2016;Wang et al., 2016) summarize hundreds of individual streams of information (images, text, videos) from the web that deal with a single concept or event, to learn a common theme or storyline or for timeline summarization. Our task, however, is to predict the correct sorting of a given story, which is different from summarization or retrieval. Ramanathan et al. (2015) attempt to learn temporal embeddings of video frames in complex events. While their motivation is similar to ours, they deal with sampled frames from a video while we attempt to learn temporal common sense from multi-modal stories consisting of a sequence of aligned image-caption pairs.

Approach
In this section, we first describe the two components in our approach: unary scores that do not use context, and pairwise scores that encode relative orderings of elements. Next, we describe how we combine these scores through a voting scheme.

Unary Models
Let σ ∈ Σ n denote a permutation of n elements (image-caption pairs). We use σ i to denote the position of element i in the permutation σ. A unary score S u (σ) captures the appropriateness of each story element i in position σ i : where P (σ i |i) denotes the probability of the element i being present in position σ i , which is the output from an n-way softmax layer in a deep neural network. We experiment with 2 networks -(1) A language-alone unary model (Skip-Thought+MLP) that uses a Gated Recurrent Unit (GRU) proposed by Cho et al. (2014) to embed a caption into a vector space. We use the Skip- Thought (Kiros et al., 2015) GRU, which is trained on the BookCorpus (Zhu et al., 2015) to predict the context (preceding and following sentences) of a given sentence. These embeddings are fed as input into a Multi-Layer Perceptron (MLP).
(2) A language+vision unary model (Skip-Thought+CNN+MLP) that embeds the caption as above and embeds the image via a Convolutional Neural Network (CNN). We use the activations from the penultimate layer of the 19-layer VGGnet (Simonyan and Zisserman, 2014), which have been shown to generalize well. Both embeddings are concatenated and fed as input to an MLP. In both cases, the best ordering of the story elements (optimal permutation) σ * = arg max σ∈Σn S u (σ) can be found efficiently in O(n 3 ) time with the Hungarian algorithm (Munkres, 1957). Since these unary scores are not influenced by other elements in the story, they capture the semantics and linguistic structures associated with specific positions of stories e.g., the beginning, the middle, and the end.

Pairwise Models
Similar to learning to rank approaches (Hang, 2011), we develop pairwise scoring models that given a pair of elements (i, j), learn to assign a score: S([[σ i < σ j ]] | i, j) indicating whether element i should be placed before element j in the permutation σ. Here, [[·]] indicates the Iverson bracket (which is 1 if the input argument is true and 0 otherwise). We develop and experiment with the following 3 pairwise models: (1) A language-alone pairwise model (Skip-Thought+MLP) that takes as input a pair of Skip-Thought embeddings and trains an MLP (with hinge-loss) that outputs S([[σ i < σ j ]] | i, j), the score for placing i before j.
(2) A language+vision pairwise model (Skip-Thought+CNN+MLP) that concatenates the Skip-Thought and CNN embeddings for i and j and trains a similar MLP as above.
(3) A language-alone neural position embedding (NPE) model. Instead of using frozen Skip-Thought embeddings, we learn a task-aware ordered distributed embedding for sentences. Specifically, each sentence in the story is embedded X = (x 1 , . . . , x n ), x i ∈ R d + , via an LSTM (Hochreiter and Schmidhuber, 1997) with ReLU non-linearities. Similar to the max-margin loss that is applied to negative examples by Vendrov et al. (2016), we use an asymmetric penalty that encourages sentences appearing early in the story to be placed closer to the origin than sentences appearing later in the story.
At train time, the parameters of the LSTM are learned end-to-end to minimize this asymmetric ordered loss (as measured over the gold-standard sequences). At test time, we use S([[σ i < σ j ]] | i, j) = L ij . Thus, as we move away from the origin in the embedding space, we traverse through the sentences in a story. Each of these three pairwise approaches assigns a score S(σ i , σ j |i, j) to an ordered pair of elements (i,j), which is used to construct a pairwise scoring model: by summing over the scores for all possible ordered pairs in the permutation. This pairwise score captures local contextual information in stories. Finding the best permutation σ * = arg max σ∈Σn S p (σ) under this pairwise model is NP-hard so approximations will be required. In our experiments, we study short sequences (n = 5), where the space of permutations is easily enumerable (5! = 120). For longer sequences, we can utilize integer programming methods or well-studied spectral relaxations for this problem.

Voting-based Ensemble
To combine the complementary information captured by the unary (S u ) and pairwise models (S p ), we use a voting-based ensemble. For each method in the ensemble, we find the top three permutations. Each of these permutations (σ k ) then vote for a particular element to be placed at a particular position. Let V be a vote matrix such that V ij stores the number of votes for i th element to occur at j th position, i.e.
. We use the Hungarian algorithm to find the optimal permutation that maximizes the votes assigned, i.e. σ * vote = arg max σ∈Σn . We experimented with a number of model voting combinations and found the combination of pairwise Skip-Thought+CNN+MLP and neural position embeddings to work best (based on a validation set).

Data
We train and evaluate our model on personal multimodal stories from the SIND (Sequential Image Narrative Dataset) (Ting-Hao Huang, 2016), where each story is a sequence of 5 images and corresponding story-like captions. The narrative captions in this dataset, e.g., "friends having a good time" (as opposed to "people sitting next to each other") capture a sequential, conversational language, which is characteristic of stories. We use 40,155 stories for training, 4990 for validation and 5055 stories for testing.

Metrics
We evaluate the performance of our model at correctly ordering a jumbled set of story elements using the following 3 metrics: Spearman's rank correlation (Sp.) (Spearman, 1904) measures if the ranking of story elements in the predicted and ground truth orders are monotonically related (higher is better). Pairwise accuracy (Pairw.) measures the fraction of pairs of elements whose predicted relative ordering is the same as the ground truth order (higher is better). Average Distance (Dist.) measures the average change in position of all elements in the predicted  story from their respective positions in the ground truth story (lower is better).

Results
Pairwise Models vs Unary Models As shown in Table 1, the pairwise models based on Skip-Thought features outperform the unary models in our task. However, the Pairwise Order Model performs worse than the unary Skip-Thought model, suggesting that the Skip-Thought features, which encode context of a sentence, also provide a crucial signal for temporal ordering of story sentences.  to duplicate the pipelined approach of Modi and Titov (2014). For this, we first parse our story sentences to extract SVO (subject, verb, object) tuples (using the Stanford Parser (Chen and Manning, 2014)). However, this step succeeds for only 60% of our test data. Now even if we consider a perfect downstream algorithm that always makes the correct position prediction given SVO tuples, the overall performance is still a Spearman correlation of just 0.473, i.e., the upper bound performance of this pipelined approach is lower than the performance of our text-only end-to-end model (correlation of 0.546) in Table 1.

Qualitative Analysis
Visualizations of position predictions from our model demonstrate that it has learnt the three act structure (Trottier, 1998) in stories -the setup, the middle and the climax. We also present success and failure examples of our sorting model's predictions. See the supplementary for more details and figures. We visualize our model's temporal common sense, in Fig. 2. The word clouds show discriminative words -the words that the model believes are indicative of sentence positions in a story. The size of a word is proportional to the ratio of its frequency of occurring in that position to other positions. Some words like 'party', 'wedding', etc., probably because our model believes that the start the story describes the setup -the occasion or event. People often tend to describe meeting friends or family members which probably results in the discriminative words such as 'people', 'friend', 'everyone' in the second and the third sentences. More-over, the model believes that people tend to conclude the stories using words like 'finally', 'afterwards', tend to talk about 'great day', group 'pictures' with everyone, etc.

Conclusion
We propose the task of "sequencing" in a set of image-caption pairs, with the motivation of learning temporal common sense. We implement multiple neural network models based on individual and pairwise element-based predictions (and their ensemble), and utilize both image and text features, to achieve strong performance on the task. Our best system, on average, predicts the ordering of sentences to within a distance error of 0.8 (out of 5) positions. We also analyze our predictions and show qualitative examples that demonstrate temporal common sense.