Game-Based Video-Context Dialogue

Current dialogue systems focus more on textual and speech context knowledge and are usually based on two speakers. Some recent work has investigated static image-based dialogue. However, several real-world human interactions also involve dynamic visual context (similar to videos) as well as dialogue exchanges among multiple speakers. To move closer towards such multimodal conversational skills and visually-situated applications, we introduce a new video-context, many-speaker dialogue dataset based on live-broadcast soccer game videos and chats from Twitch.tv. This challenging testbed allows us to develop visually-grounded dialogue models that should generate relevant temporal and spatial event language from the live video, while also being relevant to the chat history. For strong baselines, we also present several discriminative and generative models, e.g., based on tridirectional attention flow (TriDAF). We evaluate these models via retrieval ranking-recall, automatic phrase-matching metrics, as well as human evaluation studies. We also present dataset analyses, model ablations, and visualizations to understand the contribution of different modalities and model components.


Introduction
Dialogue systems or conversational agents which are able to hold natural, relevant, and coherent interactions with humans have been a long-standing goal of artificial intelligence and machine learning. There has been a lot of important previous work in this field for decades (Weizenbaum, 1966;Isbell et al., 2000;Rambow et al., 2001;Rieser et al., 2005;Georgila et al., 2006;Rieser and Lemon, 2008;Ritter et al., 2011), includ-We release all data, code, and models at: https:// github.com/ramakanth-pasunuru/video-dialogue S1: what an offside trap OMEGALUL S2: Lol that finish bro S3: suprised you didn't do the extra pass S4: @S10 a drunk bet? S5: @S11 thanks mate S6: could have passed one more S7: Pass that S1: record now! S8: !record S9: done a nother pass there Figure 1: Sample example from our many-speaker, video-context dialogue dataset, based on live soccer game chat. The task is to predict the response (bottomright) using the video context (left) and the chat context (top-right).
Current dialogue tasks are usually focused on the textual or verbal context (conversation history). In terms of multimodal dialogue, speechbased spoken dialogue systems have been widely explored (Eckert et al., 1997;Young, 2000;Janin et al., 2003;Celikyilmaz et al., 2017;Wen et al., 2015;Su et al., 2016;Mrkšić et al., 2016), as well as work on gesture and haptics based dialogue (Johnston et al., 2002;Cassell, 1999;Foster et al., 2008). In order to address the additional advantage of using visually-grounded context knowledge in dialogue, recent work introduced the visual dialogue task (Das et al., 2017;de Vries et al., 2017;Mostafazadeh et al., 2017). However, the visual context in these tasks is lim-ited to one static image. Moreover, the interactions are between two speakers with fixed roles (one asks questions and the other answers).
Several situations of real-world dialogue among humans involve more 'dynamic' visual context, i.e., video-style information of the world moving around us (both spatially and temporally). Further, several human conversations involve more than two speakers, with changing roles. In order to develop such dynamically-visual multimodal dialogue models, we introduce a new 'manyspeaker, video-context chat' testbed, along with a new dataset and models for the same. Our dataset is based on live-broadcast soccer (FIFA-18) game videos from the 'Twitch.tv' live video streaming platform, along with the spontaneous, many-speaker live chats about the game. This challenging testbed allows us to develop dialogue models where the generated response is required to be relevant to the temporal and spatial events in the live video, as well as be relevant to the chat history (with potential impact towards videogrounded applications such as personal assistants, intelligent tutors, and human-robot collaboration).
We also present several strong discriminative and generative baselines that learn to retrieve and generate bimodal-relevant responses. We first present a triple-encoder discriminative model to encode the video, chat history, and response, and then classify the relevance label of the response. We then improve over this model via tridirectional attention flow (TriDAF). For the generative models, we model bidirectional attention flow between the video and textual chat context encoders, which then decodes the response. We evaluate these models via retrieval ranking-recall, phrasematching metrics, as well as human evaluation studies. We also present dataset analysis as well as model ablations and attention visualizations to understand the contribution of the video vs. chat modalities and the model components.

Related Work
Early dialogue systems had components of natural language (NL) understanding unit, dialogue manager, and NL generation unit (Bates, 1995). Statistical learning methods were used for automatic feature extraction (Dowding et al., 1993;Mikolov et al., 2013), dialogue managers incorporated reward-driven reinforcement learning (Young et al., 2013;Shah et al., 2016), and the generation units have been extended with seq2seq neural network models (Vinyals and Le, 2015;Serban et al., 2016;Luan et al., 2016).
In addition to the focus on textual dialogue context, using multimodal context brings more potential for having real-world grounded conversations. For example, spoken dialogue systems have been widely explored Gurevych and Strube, 2004;Georgila et al., 2006;Eckert et al., 1997;Young, 2000;Janin et al., 2003;De Mori, 2007;Wen et al., 2015;Su et al., 2016;Mrkšić et al., 2016;Hori et al., 2016;Celikyilmaz et al., 2015Celikyilmaz et al., , 2017, as well as gesture and haptics based dialogue (Johnston et al., 2002;Cassell, 1999;Foster et al., 2008). Additionally, dialogue systems for digital personal assistants are also well explored (Myers et al., 2007;Sarikaya et al., 2016;Damacharla et al., 2018). In the visual modality direction, some important recent attempts have been made to use static image based context in dialogue systems (Das et al., 2017;de Vries et al., 2017;Mostafazadeh et al., 2017), who proposed the 'visual dialog' task, where the human can ask questions on a static image, and an agent interacts by answering these questions based on the previous chat context and the image's visual features. Also, Celikyilmaz et al. (2014) used visual display information for on-screen item resolution in utterances for improving personal digital assistants.
In contrast, we propose to employ dynamic video-based information as visual context knowledge in dialogue models, so as to move towards video-grounded intelligent assistant applications. In the video+language direction, previous work has looked at video captioning (Venugopalan et al., 2015) as well as Q&A and fill-inthe-blank tasks on videos (Tapaswi et al., 2016;Jang et al., 2017;Maharaj et al., 2017) and interactive 3D environments (Das et al., 2018;Yan et al., 2018;Gordon et al., 2017;Anderson et al., 2017). There has also been early related work on generating sportscast commentaries from simulation (RoboCup) soccer videos represented as non-visual state information (Chen and Mooney, 2008). Also, Liu et al. (2016a) presented some initial ideas on robots learning grounded task representations by watching and interacting with humans performing the task (i.e., by converting human demonstration videos to Causal And-Or graphs). On the other hand, we propose a new video-chat dataset where the dialogue models need to generate the next response in the sequence of chats, conditioned both on the raw video features as well as the previous textual chat history. Moreover, our new dataset presents a many-speaker conversation setting, similar to previous work on meeting understanding and Computer Supported Cooperative Work (CSCW) (Janin et al., 2003;Waibel et al., 2001;Schmidt and Bannon, 1992). In the live video stream direction, Fu et al. (2017) and Ping and Chen (2017) used real-time comments to predict the frame highlights in a video, and Barbieri et al. (2017) presented emotes and troll prediction.

Dataset Collection and Processing
For our new video-context dialogue task, we used the publicly accessible Twitch.tv live broadcast platform, and collected videos of soccer (FIFA-18) games along with the users' live chat conversations about the game. This dataset has videos involving various realistic human actions and events in a complex sports environment and hence serves as a good testbed and first step towards multimodal video-based dialogue data. An example is shown in Fig. 1 (and an original screenshot example in Fig. 2), where the users perform a complex 'manyspeaker', 'multimodal' dialogue. Overall, we collected 49 FIFA-18 game videos along with their users' chat, and divided them into 33 videos for training, 8 videos for validation, and 8 videos for testing. Each such video is several hours long, providing a good amount of data (Table 2).
To extract triples (instances) of video context, chat context, and response from this data, we divide these videos based on the fixed time frames instead of fixed number of utterances in order to maintain conversation topic clusters (because of the sparse nature of chat utterances count over the time). First, we use 20-sec context windows to extract the video clips and users utterances in Relevance to Video+Chat filtered response wins 34% 1st response wins 3% Non-distinguishable 63% (56 both-good, 7 both-bad) this time frame, and use it as our video and chat contexts, resp. Next, the chat utterances in the immediately-following 10-sec window (response window) that do not overlap with the next instance's context window are considered as potential responses. 1 Hence, there are only two instances (triples) in a 60-sec long video, i.e., 20-sec video+chat context window and 10-sec response window, and there is no overlap between the instances. Now, out of these potential responses, to only allow the response that has at least some good coherence and relevance with the chat context's topic, we choose the first (earliest) response that has high similarity with some other utterance in this response window (using 0.5 BLEU-4 threshold, based on manual inspection). 2 Human Quality Evaluation of Data Filtering Process: To evaluate the quality of the responses that result from our filtering process described above, we performed an anonymous (randomly shuffled w/o identity) human comparison between the response selected by our filtering process vs. the first response from the response window without any filtering, based on relevance w.r.t. video and chat context. Table 1 presents the results on 100 sample size, showing that humans in a blindtest found 90% (34+56) of our filtered responses as valid responses, verifying that our response selection procedure is reasonable. Furthermore, out of these 90% valid responses, we found that 55% are chat-only relevant, 11% are video-only relevant, and 24% are both video+chat relevant. In order to make the above procedure safe and to make the dataset more challenging, we also discourage frequent responses (top-20 most-frequent  generic utterances) unless no other response satisfies the similarity condition, hence suppressing the frequent responses. 3 If we couldn't find any utterance based on the multi-response matching procedure described above, then we just consider the first utterance in the 10-second window as the response. 4 We also make sure that the chat context window has at least 4 utterances, otherwise we exclude that context window and also the corresponding response window from the dataset. After all this processing, our final resulting dataset contains 10, 510 samples in training, 2, 153 samples in validation, and 2, 780 samples in test. 5

Dataset Analysis
Dataset Statistics Table 2 presents the full statistics on train, validation, and test sets of our Twitch-FIFA dataset, after the filtering process described in Sec. 3.1. As shown, the average chat context length in the dataset is around 68 words, and the average response length is 6.3 words.
Chat Context Size Fig. 3 presents the study of number of utterances in the chat context vs. the number of such training samples. As we limit the minimum number of utterances to 4, chat context with less than 4 utterances is not present in the dataset. From the Fig. 3, it is clear that as the number of utterances in the chat context increases, the number of such training samples decrease. Frequent Words Fig. 4 presents the top-20 frequent words (excluding stop words) and their corresponding frequency in our Twitch-FIFA dataset.
Most of these frequent words are related to soccer vocabulary. Also, some of these frequent words are twitch emotes (e.g. 'kappa', 'inceptionlove').
3 Note that this filtering suppresses the performance of simple frequent-response baseline described in Sec. 4.1. 4 Other preprocessing steps include: omit the utterances in the response window which refer to a speaker name out of the current chat context; remove non-representative utterances, e.g., those with hyperlinks; replace (anonymize) all the user identities mentioned in the utterances with a common tag (i.e., anonymizing due to similar intuitions from the Q&A community (Hermann et al., 2015)). 5 Note that this is substantially larger than or comparable to most current video captioning datasets. We plan to further extend our dataset based on diverse games and video types.

Baselines
Our simple non-trained baselines are Most-Frequent-Response (re-rank the candidate responses based on their frequency in the training set), Chat-Response-Cosine (re-rank the candidate responses based on their similarity score w.r.t. the chat context), and Nearest-Neighbor (find the Kbest similar chat contexts in the training set, take their corresponding responses, and then re-rank the candidate responses based on mean similarity score w.r.t. this K-best response set). For trained baselines, we use logistic regression and Naive Bayes methods. We use the final state of a Twitch-trained RNN Language Model to represent the chat context and response. Please see supplementary for full details.

Triple Encoder
For our simpler discriminative model, we use a 'triple encoder' to encode the video context, chat context, and response (see Fig. 5), as an extension of the dual encoder model in Lowe et al. (2015). The task here is to predict the given train- ing triple (v, u, r) as positive or negative. Let h v f , h u f , and h r f be the final state information of the video, chat, and response LSTM-RNN (bidirectional) encoders respectively; then the probability of a positive training triple is defined as follows: where W and b are trainable parameters. Here, W can be viewed as a similarity matrix which will bring the context [h v f ; h u f ] into the same space as the response h r f , and get a suitable similarity score. For optimizing our discriminative model, we use max-margin loss function similar to Mao et al. (2016) and Yu et al. (2017). Given a positive training triple (v, u, r), let the corresponding negative training triples be (v , u, r), (v, u , r), and (v, u, r ), i.e., one modality is wrong at a time in each of these three (see Sec. 5 for the negative example selection). The max-margin loss is: where the summation is over all the training triples in the dataset. M is a tunable margin hyperparameter between positive and negative training triples.

Tridirectional Attention Flow (TriDAF)
Our tridirectional attention flow model learns stronger joint spaces between the three modalities in a mutual-information way. We use bidirectional attention flow mechanisms (Seo et al., 2017) between the video and chat contexts, between the video context and the response, as well as between the chat context and the response, hence enabling attention flow across all three modalities, as shown in Fig. 6. We name this model Tridirectional Attention Flow or TriDAF. We will next discuss the bidirectional attention flow mechanism between video and chat contexts, but the same formulation holds true for bidirectional attention between video context and response, and between chat context and response. Given the video context hidden  Figure 6: Overview of our tridirectional attention flow (TriDAF) model with all pairwise modality attention modules, as well as self-attention on video context, chat context, and response as inputs.
state h v i and chat context hidden state h u j at time steps i and j respectively, the bidirectional attention mechanism is based on the similarity score: where S (v,u) i,j is a scalar, w S (v,u) is a trainable parameter, and denote element-wise multiplication. The attention distribution from chat context to video context is defined as α i: = sof tmax(S i: ), hence the chat-to-video context vector c v←u i = j α i,j h u j . Similarly, the attention distribution from video context to chat context is defined as β j: = sof tmax(S :j ), hence the videoto-chat context vector c u←v j = i β j,i h v i . We then compute similar bidirectional attention flow mechanisms between the video context and response, and between the chat context and response. Then, we concatenate each hidden state and its corresponding context vector from other two modalities, e.g., for the i th timestep of the video context. Finally, we add self-attention mechanism (Lin et al., 2017) across the concatenated hidden states of each of the three modules. 6 Ifĥ v i is the final concatenated vector of the video context at time step i, then the selfattention weights α s for this video context are the softmax of e s : where V v a , W v a , and b v a are trainable self-attention parameters. The final representation vector of the full video context after self-attention isĉ v = i α s iĥ v i . Similarly, the final representation vectors of the chat context and the response areĉ u andĉ r , respectively. Finally, the probability that the given training triple (v, u, r) is positive is: Again, here also we use max-margin loss (Eqn. 2).

Seq2seq with Attention
Our simpler generative model is a sequence-tosequence model with bilinear attention mechanism (similar to Luong et al. (2015)). We have two encoders, one for encoding the video context and another for encoding the chat context, as shown in Fig. 7. We combine the final state information from both encoders and give it as initial state to the response generation decoder. The two encoders and the decoder are all two-layer LSTM-RNNs. Let h v i and h u j be the hidden states of video and chat encoders at time step i and j respectively. At each time step t of the decoder with hidden state h r t , the decoder attends to parts of video and chat encoders and uses the combined information to generate the next token. Let α t and β t be the attention weight distributions for video and chat encoders respectively with video context vector c v t = i α t,i h v i and chat context vector c u t = j β t,j h u j . The attention distribution for video encoder is defined as (and the same holds for chat encoder): where W v a is a trainable parameter. Next, we concatenate the attention-based context information (c v t and c u t ) and decoder hidden state (h r t ), and do a non-linear transformation to get the final hidden stateĥ r t as follows: where W c is again a trainable parameter. Finally, we project the final hidden state information to vocabulary size and give it as input to a softmax layer to get the vocabulary distribution p(r t |r 1:t−1 , v, u; θ). During training, we minimize the cross-entropy loss defined as follows: where the final summation is over all the training triples in the dataset. the model to give higher generative decoder probability to the positive response as compared to all the negative ones), we use a max-margin loss (similar to Eqn. 2 in Sec. 4.2.1): where the summation is over all the training triples in the dataset. Overall, the final joint loss function is a weighted combination of cross-entropy loss and max-margin loss: L(θ) = L XE (θ) + λL MM (θ), where λ is a tunable hyperparameter.

Bidirectional Attention Flow (BiDAF)
The stronger version of our generative model extends the two-encoder-attention-decoder model above to add bidirectional attention flow (BiDAF) mechanism (Seo et al., 2017) between video and chat encoders, as shown in Fig. 7. Given the hidden states h v i and h u j of video and chat encoders at time step i and j, the final hidden states after the BiDAF ] (similar to as described in Sec. 4.2.2), respectively. Now, the decoder attends over these final hidden states, and the rest of the decoder process is similar to Sec 4.3.1 above, including the weighted joint cross-entropy and max-margin loss.

Experimental Setup
Evaluation We first evaluate both our discriminative and generative models using retrieval-based recall@k scores, which is a concrete metric for such dialogue generation tasks (Lowe et al., 2015). For our discriminative models, we simply rerank the given responses (in a candidate list of size 10, based on 9 negative examples; more details below)  Table 3: Performance of our baselines, discriminative models, and generative models for recall@k metrics on our Twitch-FIFA test set. C and V represent chat and video context, respectively. in the order of the probability score each response gets from the model. If the positive response is within the top-k list, then the recall@k score is 1, otherwise 0, following previous Ubuntu-dialogue work (Lowe et al., 2015). For the generative models, we follow a similar approach, but the reranking score for a candidate response is based on the log probability score given by the generative models' decoder for that response, following the setup of previous visual-dialog work (Das et al., 2017). In our experiments, we use recall@1, recall@2, and recall@5 scores. For completeness, we also report the phrase-matching metric scores: METEOR (Denkowski and Lavie, 2014) and ROUGE (Lin, 2004) for our generative models. We also present human evaluation.
Training Details For negative samples, during training, for every positive triple (video, chat, response) in the training set, we sample 3 random negative triples. For validation/test, we sample 9 random negative responses elsewhere from the validation/test set. Also, the negative samples don't come from the video corresponding to the positive response. More details of negative samples and other training details (e.g., dimension/vocab sizes, visual feature details, validationbased hyperparamater tuning and model selection), are discussed in the supplementary.
6 Results and Analysis

Human Evaluation of Dataset
First, the overall human quality evaluation of our dataset (shown in Table 1) demonstrates that it contains 90% responses relevant to video and/or chat context. Next, we also do a blind human study on the recall-based setup (on a set of 100 samples from the validation set), where we anonymize the positive response by randomly mixing it with 9 tricky negative responses in the retrieval list, and ask the user to select the most relevant response for the given video and/or chat context. We found that human performance on this task is around 55% recall@1, demonstrating that this 10-way-discriminative recall-based task setup is reasonably challenging for humans, 7 but also that there is a lot of scope for future model improvements because the chance baseline is only 10% and the best-performing model so far (see Sec. 6.3) achieves only 22% recall@1 (on dev set), and hence there is a large 33% gap.

Baseline Results
Table 3 displays all our primary results. We first discuss results of our simple non-trained and trained baselines (see Sec. 4.1). The 'Most-Frequent-Response' baseline, which just ranks the 10-sized response retrieval list based on their frequency in the training data, gets only around 10% recall@1. 8 Our other non-trained baselines: 'Chat-Response-Cosine' and 'Nearest Neighbor', which ranks the candidate responses based on (Twitch-trained RNN encoder's vector) cosine similarity with chat-context and K-best training contexts' response vectors, respectively, achieves slightly better scores. We also show that our simple trained baselines (logistic regression and nearest neighbor) also achieve relatively low scores, indicating that a simple, shallow model will not work on this challenging dataset.

Discriminative Model Results
Next, we present the recall@k retrieval performance of our various discriminative models in Ta   ble 3: dual encoder (chat context only), dual encoder (video context only), triple encoder, and TriDAF model with self-attention. Our dual encoder models are significantly better than random choice and all our simple baselines above, and further show that they have complementary information because using both of them together (in 'Triple Encoder') improves the overall performance of the model. Finally, we show that our novel TriDAF model with self-attention performs significantly better than the triple encoder model. 9

Generative Model Results
Next, we evaluate the performance of our generative models with both retrieval-based recall@k scores and phrase matching-based metrics as discussed in Sec. 5 (as well as human evaluation). We first discuss the retrieval-based recall@k results in Table 3. Starting with a simple sequenceto-sequence attention model with video only, chat only, and both video and chat encoders, the re-call@k scores are better than all the simple baselines. Moreover, using both video+chat context is again better than using only one context modality. Finally, we show that the addition of the bidirectional attention flow mechanism improves the performance in all recall@k scores. 10 Note that generative model scores are lower than the discriminative models on retrieval recall@k metric, which is expected (see discussion in previous visual dialogue work (Das et al., 2017)), because discriminative models can tune to the biases in the response candidate options, but generative models are more useful for real-world tasks such as 9 Statistical significance of p < 0.01 for recall@1, based on the bootstrap test (Noreen, 1989;Efron and Tibshirani, 1994)   generation of novel responses word-by-word from scratch in Siri/Alexa/Cortana style applications (whereas discriminative models can only rank the pre-given list of responses). We also evaluate our generative models with phrase-level matching metrics: METEOR and ROUGE-L, as shown in Table 4. Again, our BiDAF model is stat. significantly better than non-BiDAF model on both METEOR (p < 0.01) and ROUGE-L (p < 0.02) metrics. Since dialogue systems can have several diverse, non-overlapping valid responses, we consider a multi-reference setup where all the utterances in the 10-sec response window are treated as valid responses. 11

Human Evaluation of Models
Finally, we also perform human evaluation to compare our top two generative models, i.e., the video+chat seq2seq with attention and its extension with BiDAF (Sec. 4.3), based on a 100-sized sample. We take the generated response from both these models, and randomly shuffle these pairs to anonymize model identity. We then ask two annotators (for 50 task instances each) to score the responses of these two models based on relevance. Note that the human evaluators were familiar with Twitch FIFA-18 video games and also the Twitch's unique set of chat mannerisms and emotes. As shown in Table 5, our BiDAF based generative model performs better than the non-BiDAF one, which is already quite a strong video+chat encoder model with attention.

Negative Training Pairs
We also compare the effect of different negative training triples that we discussed in Sec. 5. Ta Figure 9: Attention visualization: generated word 'goal' in response is intuitively aligning to goal-related video frames (top-3-weight frames highlighted) and context words (top-10-weight words highlighted).
training triple (with just a negative response) vs. three negative training triples (one with negative video context, one with negative chat context, and another with negative response), showing that using the 3-negative examples setup is substantially better. Table 7 shows the performance comparison between the classification loss and max-margin loss on our TriDAF with self-attention discriminative model (Sec. 4.2.2). We observe that max-margin loss performs better than the classification loss, which is intuitive because max-margin loss tries to differentiate between positive and negative training example triples.

Generative Loss Functions
For our best generative model (BiDAF), Table 8 shows that using a joint loss of cross-entropy and max-margin is better than just using only cross-entropy loss optimization (Sec. 4.3.1). Maxmargin loss provides knowledge about the negative samples for the generative model, hence improves the retrieval-based recall@k scores.

Attention Visualization and Examples
Finally, we show some interesting output examples from both our discriminative and generative models as shown in Fig. 8  visualizes that our models can learn some correct attention alignments from the generated output response word to the appropriate (goal-related) video frames as well as chat context words.

Conclusion
We presented a new game-chat based videocontext, many-speaker dialogue task and dataset. We also presented several baselines and state-ofthe-art discriminative and generative models on this task. We hope that this testbed will be a good starting point to encourage future work on the challenging video-context dialogue paradigm.
In future work, we plan to investigate the effects of multiple users, i.e., the multi-party aspect of this dataset. We also plan to explore advanced video features such as activity recognition, person identification, etc.