A Visually-grounded First-person Dialogue Dataset with Verbal and Non-verbal Responses

In real-world dialogue, ﬁrst-person visual information about where the other speakers are and what they are paying attention to is crucial to understand their intentions. Non-verbal responses also play an important role in social interactions. In this paper, we propose a visually-grounded ﬁrst-person dialogue (VFD) dataset with verbal and non-verbal responses. The VFD dataset provides manually annotated (1) ﬁrst-person images of agents, (2) utterances of human speakers, (3) eye-gaze locations of the speakers, and (4) the agents’ verbal and non-verbal responses. We present experimental re-sults obtained using the proposed VFD dataset and recent neural network models (e.g., BERT, ResNet). The results demonstrate that ﬁrst-person vision helps neural network models correctly understand human intentions, and the production of non-verbal responses is a challenging task like that of verbal responses. Our dataset is publicly available 1 .

Although these studies and resources have been shown to be useful, there are currently two limitations. First, in image-grounded dialogue tasks, 1 https://randd.yahoo.co.jp/en/softwaredata U: これのＬはないのかしら V: 同じ服がたくさんあるからどれかはLじゃないかな N: 同じ服のサイズをチェックする -------------------------U: I wonder if there is an L for this. V: We have a lot of the same clothes, so I'm guessing one of them is an L. N: Check out the same clothing size. Figure 1: Example of proposed VFD dataset. "U", "V", and "N" denote a human utterance, the agent's verbal response, and the agent's non-verbal response (i.e., action), respectively. All utterances and responses are represented in Japanese. English translations are added below for easier understanding. The red line links the eyes to the gaze location.
human speakers do not appear in the agents' vision because images are used as the topic of conversation, and the speakers are required to discuss the input image. However, in real-world dialogue scenarios, first-person visual information about where the human speaker is and what they are paying attention to is crucial for agents to understand human intentions. To understand this, we show an example in Figure 1. Without the first-person image, it is difficult for the agent to recognize that the pronoun "this" in the human utterance (U) refers to the article of yellow clothing rather than any other products (e.g., brown clothes).
previous studies considered non-verbal input information (e.g., human facial expressions), they did not consider the agents' non-verbal responses (i.e., actions). Non-verbal responses often play an important role in dialogue systems. For example, a museum tour-guide robot should use nonverbal gestures to explain things to the audience better. Even in ordinary conversation, non-verbal responses such as "making a smile" or "helping to lift luggage" are often crucial for social interactions in conjunction with verbal responses. Thus, we propose a visually-grounded firstperson dialogue (VFD) dataset with verbal and nonverbal responses. As shown in Figure 1, the VFD dataset comprises (1) first-person images of agents, (2) utterances of human speakers, (3) eye-gaze locations of the speakers, and (4) the agents' verbal and non-verbal responses to the utterances. Here, human utterances and agents' verbal and non-verbal responses were manually annotated for first-person images (with eye-gaze locations) in the GazeFollow dataset (Recasens et al., 2015) using crowdsourcing with carefully-designed settings, resulting in 308K verbal and 81K non-verbal dialogues. This paper also presents experimental results obtained using the VFD dataset and recent neural network models, e.g., BERT (Devlin et al., 2019) and ResNet (He et al., 2016).
Our primary contributions are summarized as follows. (1) We present a new multimodal dialogue dataset that contains visually-grounded first-person dialogues with human speakers' eye-gaze locations.
(2) We provide the manually-annotated non-verbal responses of agents, which are often crucial for social communication in the real world. (3) Our experimental results demonstrate that first-person vision helps recent neural network models understand human intentions accurately and that the production of non-verbal responses is a challenging task like that of verbal responses. Table 1 summarizes the related visually-grounded dialogue datasets.

Related Work
Several multimodal dialogue datasets have investigated task-oriented situations. For example, MMD dataset (Saha et al., 2018) contains dialogues between shoppers of fashion products and sales agents. TalkTheWalk dataset (de Vries et al., 2018) aims to guide tourists to their destinations. In VisDial dataset (Das et al., 2017) and AVSD dataset (Alamri et al., 2019), an agent must answer questions about an input image (or video) given dialogue history. Unlike these datasets, which can only work in some limited scenarios, we aim to cover both task-oriented and non-task-oriented dialogue systems.
As shown in Table 1, IGC dataset (Mostafazadeh et al., 2016), like our VFD dataset, assumes both task-oriented and non-task oriented situations. However, in IGC, images are used as a conversation topic, and the human speakers do not appear in the agents' vision. In contrast, VFD dataset contains dialogues based on "first-person" images (and eye-gaze information), which are useful for figuring where the human speaker is and what he or she is focusing on.
Like our VFD dataset, SDG dataset (Hu et al., 2016) contains dialogues with non-verbal actions. However, SDG focuses on gestures (or body languages), e.g., "making a cup shape with the right hand", which are categorized into 271 gesture classes. In contrast, VFD dataset represents nonverbal responses as text (typically sentences) to cover a wider range of gestures, e.g., "Check out the same clothing size", "Buy one of the pumpkins a girl has", etc.
In addition, our VFD dataset is large in comparison to other datasets. It is twice the size of the MMD dataset and approximately 75 times the size of the IGC dataset. IGC dataset is small because it provides only validation and test sets.

Task Definition
In this paper, we define the visually-grounded firstperson dialogue as to produce an utterance or take action given a human utterance and the agent's first-person vision.
Formally, the input to the system can be represented as a tuple of a human utterance u and the agent's first-person vision v. The first-person vision v is assumed to be used to understand human intentions. Thus, v can be factorized into firstperson image i and more explicit visual hints for the human intentions g, i.e., v = (i, g). We use eye-gaze locations for the explicit hints g. For the input triplet (u, i, g), an agent is assumed to produce a verbal response r v and non-verbal response (i.e., actions) r n . Here, we use textual descriptions to represent non-verbal responses, as shown in Fig The VFD dataset can be interpreted as a collection of quintuples, i.e., {(u, i, g, r v , r n )}. We describe how we collected these five elements in the following.

Dataset Construction
First-person Images & Eye-gaze Locations. We used the 34,775 first-person images with eyegaze annotations in the GazeFollow dataset (Recasens et al., 2015). Here, eye-gaze locations are represented as coordinates (x, y) eye and (x, y) gaze . In Figure 1, the eye location and the gaze location are linked by a red line.
Human Utterances. Following Le Minh et al. (2018), who collected English utterances for firstperson images using Amazon Mechanical Turk (AMT), we first translated their English instructions into Japanese. Then, we used a crowdsourcing platform similar to AMT called Yahoo! Crowdsourcing, operated by Yahoo Japan Corporation. It can be safely assumed that Yahoo! Crowdsourcing participants will be proficient in Japanese because such proficiency is required to sign up, navigate the user interface, and participate in the microtask market. In the annotation instructions, we showed an image with a single person marked with a red dot and asked the participants to imagine this person is speaking. We then asked the participants to submit what they think the speaker is likely saying.
The following notes were included in the instructions to avoid unexpected or trivial annotations.
Note 1: "Never use the same lines again. Please write a different sentence every time." Note 2: "Do not put a commentary from a third-party perspective." Note 3: "Please do not write something people would not usually say in this situation. Please avoid lines that contain abuse and prejudice, words likely to cause a quarrel, and over-familiar tone. Please do not assume that the talking person has an extreme personality. As it is not a comedy, you do not have to write a funny line." Verbal and Non-verbal Responses. The participants were shown the images and the utterances collected in the previous step. Then, they were asked to enter what to say (i.e., a verbal response) and what to do (i.e., a corresponding non-verbal action). To focus on dialogues requiring visual grounding, we also asked the participants the following question: "Whenever possible, please try to use some additional information found in the image to frame your response, so that your response is not entirely predictable from the utterance." We also asked the participants to enter a special dummy response "x" if it is inappropriate to respond. For a single utterance, five participants were asked to enter a response and an action. After conducting this pilot task, we examined the results and selected promising participants (comprising a whitelist) for future task requests. Only participants on the whitelist could perform the next task. We also used the whitelist from our previous study for text entry tasks. We repeated this selection process until the final whitelist included approximately 1,600 participants. Here, approximately 200-250 of these participants regularly participated in the actual VQA collection task. Note that we allocated tasks in small batches over the course of a few months to prevent participants from working long hours.
Despite the above measures, however, the resulting 327,884 data instances contained noisy or trivial responses. To eliminate such undesirable responses automatically, we created a list of erroneous patterns manually via visual inspection, and responses matching the patterns were removed. The dummy responses "x" were also removed. Finally, the total number of verbal and non-verbal responses were 308,793 and 81,867, respectively. The gap between the number of verbal and nonverbal responses is due to the fact that non-verbal responses contained more dummy responses than verbal responses. Quality Evaluation. To assess the quality of the resulting dataset, we qualitatively inspected 1,000 randomly-sampled data instances. Of those 1,000 samples, there was only one sample that was clearly as bad as spam. In addition, the percentage of slightly inappropriate samples was only 2% of the total. Therefore, we considered the quality of the VFD dataset to be sufficient for our purposes.
Among those 2% noisy samples, we found the following erroneous patterns: For utterances, some were for the person who took the photo rather than the person appearing in the photo. One utterance was very comedic. For the images, there were two images without people, e.g., a mannequin or food. In addition, there was one image that did not show the speaker's face and one image that shows many people. These errors mainly stem from the original GazeFollow dataset (Recasens et al., 2015). For verbal or non-verbal responses, one response ignored the human utterance. In addition, some responses ignored the images or were not from the robot's perspective, and some responses were offensive to the speakers. Some non-verbal responses were not actionable, e.g., "Nice Shot!" and "That's tough." We did not remove these noisy samples in the current version because it was difficult to remove them all automatically, and the noisy samples represent only 2% of the total.

Dataset Analysis
We perform a more detailed analysis of the VFD dataset.
We explore the topical diversity of the dataset.  Specifically, we use a Japanese BERT model pretrained on Japanese Wikipedia from HuggingFace's Transformers library (Wolf et al., 2019) and project each word in dialogue text (i.e., utterance, verbal response, and non-verbal response) to 768dimensional vectors. Then, we average the word embeddings to obtain a vector representation of the dialogue text (utterance + two responses). Finally, we use agglomerative clustering (Karypis et al., 2000) to obtain 70 clusters for the dialogues. We select 4 of the 70 clusters and visualize them by principal component analysis (PCA), as shown in Figure 2. These 4 clusters, i.e., food, photo, music, and sports, represent typical dialogue topics in the VFD dataset. Figure 2 shows that the dialogue topics are widely distributed in the VFD dataset. We also calculate the linguistic statistics of the texts. Here, we use MeCab morphological analyzer (Kudo et al., 2004) to tokenize the dialogue text into tokens. Table 2 summarizes the results. The average numbers of tokens (or text length) in the utterances and verbal and non-verbal responses are 7.6, 6.8, and 3.5, respectively. The number of uniques words (i.e., vocabulary size) in the ut-  terances and verbal and non-verbal responses are 13,352, 23,880, and 7,711, respectively. These facts imply that verbal responses tend to be diverse, which is desirable for training well-generalized machine learning models. In contrast, the textual description of the non-verbal responses is much simpler than the utterances and verbal responses, which is desirable when building a model to perform actual actions from a textual description of a given non-verbal response.

Comparison
Here, we emphasize the characteristics of the VFD dataset by comparing it to the IGC dataset (Mostafazadeh et al., 2016), which is most similar to the VFD dataset. Figure 3 compares two examples each from the VFD dataset (left) and IGC dataset (right). In the IGC dataset, an image is used as a topic of conversation, and the human speaker does not appear in the agent's vision. In contrast, our VFD dataset uses an image as taken from the agent's first-person camera as a dynamic visual environment. In addition, the VFD dataset contains manually-annotated non-verbal responses and human eye-gaze locations.

Task Setting
In this section, we perform experiments with the task of selecting a verbal response and a non-verbal response from candidate response sets given a human utterance, a first-person image, and eye-gaze locations. Although it is possible to train a response generator using VFD dataset, the selection task was chosen for ease of evaluation and simplicity. It is worth noting that, in our experiments, the eye-gaze locations are given to the input as an oracle during validation and testing. In the real world, this information can be given by automatic gaze-estimation techniques (Chong et al., 2018;Wei et al., 2018) developed in computer vision.

Data
For the verbal response selection task, VFD dataset is split into training, validation, and test sets each containing 569K, 12K, and 12K dialogues. For the non-verbal response selection task, the training, validation, and test sets consist of 151K, 3K, and 3K dialogues. The images are completely separated across the training/validation/test sets. For the training data, we sample negative responses randomly from the training set and fix them throughout the epochs. For the validation and test data, we perform the same negative sampling across the models for a fair comparison. The data splits and the negative samples used for validation and testing will be provided along with the VFD dataset.

Metrics
Following Lowe et al. (2015), we use Recall@k (denoted R n @k) for response-selection evaluation. Here, the model selects the k most likely responses from n available candidates. Note that only one response among the n candidates is true, and the others are sampled randomly from the same set. The prediction is correct if the true response is among the top k list. We report R 10 @1, R 10 @2, R 10 @5, and R 2 @1. Figure 4 shows the architecture of the baseline models. We follow the same ranking strategy of Lowe et al. (2015) to develop the baseline neural network models for our selection-based dialogue task. That is, the response-selection problem in our experiments is to find a verbal (or non-verbal) response with the highest score for an input triplet x = (u, i, g), i.e.,

Baseline Models
where Score(x, r) ∈ R denotes a real-valued score of the response r for the input utterance u, input image i, and input eye-gaze locations g. Figure 4: Overview of the baseline architecture for the response scoring. Given an input triplet of (utterance u, image i, eye-gaze locations g) and a candidate response r, the baseline model calculates the matching score between the input and the response.
We define the scoring function in Eq.
(1) as follows: where v x and v r denote the feature vectors for x = (u, i, g) and r. W and b are a weight matrix and a bias vector, respectively. We first apply two neural encoders, f u and f i , to extract feature vectors from the input utterance u and the input image i: We also represent the coordinates of eye-gaze locations as a four-dimensional vector, v g ∈ R 4 . We concatenate these feature vectors to get v x : where [ · ; · ] denotes concatenation of vectors. The feature vector of a candidate response r is also calculated using a different text encoder f r : v r = f r (r).
For training, we minimize the binary crossentropy loss by applying a sigmoid function to the predicted scores.
In the following subsection, we describe the text encoders (i.e., f u and f r ) and the image encoder (i.e., f i ) we used in our experiments.
Text Encoder: We employ two neural network variants for encoding utterances and responses: Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) and Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2019). With the LSTM model, we use the last hidden state as the utterance or response features. With the BERT model, we insert a [CLS] token before and a [SEP] token after the utterance (or response) and use the hidden state of [CLS] tokens in the last layer of BERT as the feature vector. It is worth noting that we develop two different text encoders for f u and f r , which are optimized during the training.
Image Encoder: We employ two neural network models for image encoding: VGGNet (Simonyan and Zisserman, 2015) and ResNet (He et al., 2016), which are used widely for image classification and have proven to be effective methods. We use the 16-layer VGGNet and replace the last linear layer named fc6 with a learnable linear layer whose output dimensionality is 4096. We use the 4096dimensional vector as the image features. We also use the 50-layer ResNet. We use the last fully connected layer as the image features.

Other Settings
We used the Adam (Kingma and Ba, 2015) optimizer for training. The learning rate was fixed at 0.0001, and the mini-batch size was fixed at 64. The training was terminated when validation accuracy drops more than 1.5 points compared to the highest validation accuracy. The training typically converged in approximately 3 days for the verbal response selection task and 1 day for the non-verbal response selection task on an Nvidia GeForce GTX 1080 GPU. For the LSTM-based text encoding, we used MeCab (Kudo et al., 2004) for tokenization and used fastText (Bojanowski et al., 2017) for word embeddings, which were pretrained on Japanese Common-Crawl and Wikipedia articles. The word-embedding and LSTM dimensions were set to 300 and 100, respectively. For the BERT-based encoding, we used a BERT model named "bert-base-japanese-whole-word-masking" from Hugging Face's (Wolf et al., 2019) library, which was pre-trained on Japanese Wikipedia using Whole-Word-Masking. For data augmentation, we applied random cropping, random horizontal flipping, and normalization transformations to the original images during training. The baseline models were trained separately for verbal and non-verbal response selection tasks.

Encoders
Verbal Response Non-verbal Response Input Text Image R 10 @1 R 10 @2 R 10 @5 R 2 @1 R 10 @1 R 10 @2 R 10 @5 R 2 @1 U LSTM -50.0 69.  Table 3: Comparison results of the baseline models in verbal and non-verbal response selection tasks. U, I, and G denote that we use utterances, images, and eye-gaze locations for inputs, respectively. First-person images and eye-gaze locations improve the performance for almost all encoder combinations.

Quantitative Results
We report the evaluation scores of the baseline models in the verbal and non-verbal response selection tasks. We summarize the results in Table 3. U, I, and G denote that we use utterances, images, and eye-gaze locations for inputs, respectively. For almost all encoder combinations (e.g., BERT × VGGNet), first-person images improve the verbal and non-verbal response-selection performance by up to 5.6 points (See U vs. U+I). In addition, especially when using BERT, eye-gaze locations always improve the performance further by up to 1.4 points (See U+I vs. U+I+G). These results indicate that the eye-gaze information from the agents' firstperson perspective is effective in understanding the human intentions.
Overall, the BERT scores are higher than the LSTM scores for all input variations: U, U+I, U+I+G. This is consistent with results in other NLP tasks. As for image encoders, VGGNet achieves higher scores than ResNet, which is often observed in multimodal tasks (Wang et al., 2017;Ouyang et al., 2017;Yudistira and Kurita, 2017). BERT × VGGNet using all the input modalities achieves the highest R 10 @1 score of 53.6%.
Interestingly, the best R 10 @1 score for nonverbal response selection is about 7 points worse than the score for verbal-response selection. This fact indicates that producing non-verbal responses is more difficult than producing conventional verbal responses and there is room for improvement.

Qualitative Analysis
Here, we inspect the verbal and non-verbal responses selected by the baseline model, BERT × VGGNet. Figure 5 (a) shows the selected verbal responses. The selected non-verbal responses are shown in Figure 5 (b). The other examples can also be found in the supplemental material.
In the leftmost example of Figure 5 (a), the model cannot understand what the pronoun "this" in the human utterance refers to without the image. By using the image information (U+I), the model wrongly focuses on the human face in the image and responds, "You have a funny face." By using the eye-gaze locations (U+I+G), the model understands that the person is paying attention to the green apple and succeeds in finding the correct response.
In the second example from the left in Figure 5 (a), the human utterance, "What do you think?", is too ambiguous. By using the image information (U+I), we can see that the model wrongly focuses on the speaker, as in the previous example. The eye-gaze locations (U+I+G) allow the model to understand that the speaker asks about the painting and finds the correct response.
The right two examples in Figure 5 (a) show the failure cases. In the third example from the left, it is difficult for the model to select the correct response, "You can't do that with your bare hands", because it requires the world knowledge that fish are hard to catch without tools. In the rightmost example, the woman's gaze is on the computer, which wrongly lets the model focus on the computer instead of the next "hospital room." Similar phenomena can be observed for nonverbal response selection. In the left two examples in Figure 5 (b), it is hard to identify the human intentions from the utterances alone. The images (U+I) provide important contextual information, but it is still not sufficient for properly understanding the intentions of the utterances. The eye-gaze locations (U+I+G) enable the models to identify the human intentions and respond more accurately. For instance, in the second example from the left, it is hard to understand what the man is doing due to the mess in the room; however, if you look at the tip of the man's gaze, you can see that he is cutting vegetables with a kitchen knife. In such cases, eye-gaze information works particularly well when many objects are present.
We also show the failure examples for nonverbal response selection. We consider that the third example from the left is difficult because the agent has to be thoughtful just like preparing a jacket. In the rightmost example, the man is asking someone to keep an eye on the dog; however, he is not looking at it, so it appears that his gaze has a negative effect.
In summary, we found that first-person images and eye-gaze information are effective in the following cases: (1) when the utterance is ambiguous, e.g., when it contains indicative pronouns like "this", and (2) when there are many objects in the image, and it is difficult to identify what the speaker is talking about. These are very common in everyday conversation. Thus, we consider that it would be effective and beneficial to develop social robots that interact with first-person visual information, including gaze, in real-world applications.

Conclusion
In this paper, we have presented the VFD dataset with verbal and non-verbal responses. We manually annotated 308K human utterances and 308K verbal and 81K non-verbal responses of agents, which are grounded in the agents' first-person images with human eye-gaze locations. We confirmed the validity of the first-person view in the experiments for the response selection tasks; however, this task (especially, non-verbal response production) remains challenging, and improvements are required.

A Supplemental Material
Here, we show additional examples for verbal and non-verbal responses selected by the baseline model, BERT × VGGNet.
What these examples have in common is that the intentions of the utterances are ambiguous in isolation, which is common in everyday conversation. For instance, in the leftmost example of Figure 6 (a), it is hard for machines to identify what the pronoun "this" refers to.
We show four successful examples on the left side of Figure 6 (a), (b). By using first-person perspective visual information (U+I or U+I+G), the models can understand the intentions correctly. For instance, in the leftmost example of Figure 6 (a), the model correctly understands that the speaker is asking about his playing. In the second example from the left in Figure 6 (a), the visual information allows the model to understand that the speaker is talking about the golf game. Also, in the left two examples in Figure 6 (b), the models successfully utilize the visual information to understand the human intentions.
We also show four failure examples on the right side of Figure 6 (a), (b). In the third example from the left in Figure 6 (a), it is difficult to choose the ground truth response (V * ) because the human speaker is watching TV and talking about it, while the ground truth one is talking about the laptop on the desk. In the rightmost example in Figure 6 (a), the visual information is not useful because the utterance is not sufficiently related to the given image. The third example from the left in Figure 6 (b) is also difficult because the agent has to have the common knowledge that we must put away the used trays before we leave in a cafe. In the rightmost example in Figure 6 (b), we consider that the visual information wrongly lets the models take actions related to more specific information about players or shoes rather than the more general action of suggesting to buy the shoes.