MultiDM-GCN: Aspect-Guided Response Generation in Multi-Domain Multi-Modal Dialogue System using Graph Convolution Network

In the recent past, dialogue systems have gained immense popularity and have become ubiquitous. During conversations, humans not only rely on languages but seek contextual information through visual contents as well. In every task-oriented dialogue system, the user is guided by the different aspects of a product or service that regulates the conversation towards selecting the product or service. In this work, we present a multi-modal conversational framework for a task-oriented dialogue setup that generates the responses following the different aspects of a product or service to cater to the user’s needs. We show that the responses guided by the aspect information provide more interactive and informative responses for better communication between the agent and the user. We first create a Multi-domain Multi-modal Dialogue (MDMMD) dataset having conversations involving both text and images belonging to the three different domains, such as restaurants, electronics, and furniture. We implement a Graph Convolutional Network (GCN) based framework that generates appropriate textual responses from the multi-modal inputs. The multi-modal information having both textual and image representation is fed to the decoder and the aspect information for generating aspect guided responses. Quantitative and qualitative analyses show that the proposed methodology outperforms several baselines for the proposed task of aspect-guided response generation.


Introduction
Conversational systems have become ubiquitous in our everyday lives. Previous research suggests that the conversational agents need to be more interactive and informative for building engaging systems (Takayama and Arase, 2019;Shukla et al., 2019). * * First two authors have contributed equally These research indicates that engaging conversations include visual cues (e.g., a video or images) or audio cues (e.g., tone, the pitch of the speaker). Information contained in these cues is often integral for the conversation. In Figure 1, we show an example of a conversation where the visual cues in the form of images are crucial for better understanding and interactive dialogue between the agent and the user. The appropriate responses to the user queries are highly dependent on the visual information pertaining to the different aspects of the various images in the conversation. Thus, it is natural to conclude that a conversational agent would be more effective if the visual information were part of its underlying conversational model. Multi-modality in goal-oriented dialogue systems  (Saha et al., 2018) for the fashion domain has established the significance of visual information for effective communication between the user and agent. Inspired by their works, we take a step forward by creating a multi-modal aspect guided response framework for a multi-domain goal-oriented dialogue system. From Figure 1, it can be observed that visual information of the aspects encourages improved communication and informative response generation by the agent with regards to the user queries.
In this paper, we propose the task of generating informative and interactive responses guided by the aspect information in a multimodal dialogue system. Firstly, we create a high quality multi-modal conversational dataset. Thereafter, we present a multi-modal graph convolutional network (GCN) that incorporates information from both textual and visual modalities to generate the aspect-guided responses. We aim to create a generalized response generation framework for a multi-domain multimodal dialogue system that is informative, interesting, aspect-guided, and logical. Hence, the main contributions of this work are: (i) We propose the task of aspect-guided response generation for the interactive and informative responses in a multimodal dialogue system. This is the first attempt to incorporate aspect information in the multi-modal dialogue systems to the best of our knowledge. (ii) We create a Multi-domain Multi-modal Dialogue (MDMMD) dataset comprising both text and images having conversations belonging to the three different domains, namely restaurant, electronics, and furniture. (iii) We propose a multi-modal graph convolutional framework for response generation while explicitly providing aspect information to the decoder to generate aspect-guided responses. (iv) The proposed model for both automatic and human evaluation shows its effectiveness over several baselines.

Related Work
Uni-modal Dialogue Systems The effectiveness of deep learning has shown significant progress in dialog generation. Deep neural frameworks, as shown in the (Vinyals and Le, 2015;Shang et al., 2015), are very effective in modeling conversations. The hierarchical encoder-decoder system was studied in (Sordoni et al., 2015;Serban et al., 2016Serban et al., , 2017 to preserve the dependencies among the utterances in dialogue. Recently, memory networks (Madotto et al., 2018;Raghu et al., 2018;Reddy et al., 2019;Tian et al., 2019;Wu, 2019;Lin et al., 2019b) have been investigated to capture the contextual information in dialogues for generating responses. In taskoriented dialogues, hierarchical pointer networks (Raghu and Gupta) have been used to generate the responses. With the release of the task-oriented dialog dataset, such as MultiWoz , a few works (Budzianowski and Vulić, 2019; have emerged that operate in a multi-domain dialogue setting. The meta-learning approach (Mi et al., 2019;Qian and Yu, 2019) has been implemented on the various datasets to improve the domain adaptability for generating responses.

Multi-modal Dialogue Systems
Recently, research on the dialog system has shifted towards integrating various modalities, such as images, audio, and video, along with text, to obtain the information to build a robust framework. The research reported in (Das et al., 2017;Mostafazadeh et al., 2017;De Vries et al., 2017;Gan et al., 2019) has been effective in narrowing the gap between vision and language. Similarly in (Le et al., 2019;Alamri et al., 2018;Lin et al., 2019a), DSTC7 dataset has been used for response generation by incorporating audio and visual features. The release of the Multi-modal Dialog (MMD) dataset (Saha et al., 2018), having conversations on the fashion domain with the information from both texts and images, has facilitated the research on response generation (Agarwal et al., 2018b,a;Liao et al., 2018;Chauhan et al., 2019;Cui et al., 2019) in a multi-modal setup. Our newly designed framework is different from these existing ones, as our focus here is on creating aspect guided multi-modal dialogue dataset that contains the information of three different domains. Our present work distinguishes from the prior works of multi-modal dialog systems in the sense that we aim at generating responses conditioned on a particular aspect of the product or service in accordance with the conversational history.
Our research is novel concerning the following two aspects viz. (i). our research is focused on the task of aspect controlled dialog generation in a multi-modal setup; and (ii). we create a highquality dataset that includes conversations belonging to multiple domains having both textual and image information.

Dataset
In this section, we describe the procedure of creating multi-modal dialogue data.

Data Creation Process
We come up with the following two top-level principles for domain selection after closer review and extensive discussions: (i). it encompasses a broad group of task-oriented frameworks used by industries/service providers and is likely to build user interfaces; (ii). for deeper comprehension and clarification of the services, the domains need visual details. Therefore, we choose to curate conversa- tions belonging to three distinct domains in our newly established large-scale MDMMD dataset 1 , namely restaurants, electronics, and furniture. With the cooperation of a dedicated team of 15 domain experts corresponding to each domain, the multimodal dyadic dialogue aggregation was achieved. Given the various aspects of a product or service, the professionals from each domain demonstrated several dialogue flow during the selection and procurement of a specific product. The importance of various aspects in the sale of a product was established, whereafter these domain details were integrated with different chat sessions to make the conversations seamless and free-flowing. The creation of data concerns with the following key steps: (1). Data gathering; (2). Building a large-scale multimodal conversation includes both text and images, thus integrating the domain's information into the interaction; and (3). Aspect Annotation.
1. Data Gathering Method: As a consequence of the experts' interactions, we recognize the nuances of different styles in a natural conversation for every domain, guided by the background knowledge both the domain experts and the customer use these style information in their conversation. The necessary steps followed in this process are the following: (i). We crawl approximately 1 million products belonging to the different domains, such as food items, restaurants, electronics, furniture from the different websites together with the images of the products, and semi/un(structured) information; (ii). The domain experts manually inspected the unstructured data according to the domain information and parsed the free text in a structured format; (iii) Each domain selected was closely observed first. Then, the aspect categories were listed to mark the aspect information. The different aspect categories, along with the associated aspect terms belonging to the different domains, are listed in Table 1.
2. Creating user-agent Dialogues: The do-main experts who had detailed knowledge of the respective domains along with crowd-sourced workers were employed to build goal-oriented multimodal conversations using a Wizard-of-Oz (WOZ) approach. For every conversation belonging to a particular domain, the domain experts assume the role of a system agent while the workers act like the customer agents. Different criteria for creating the conversations, such as the minimum length of the conversation, number of aspect categories, number of images in response, number of goals, number of complex requests, etc, were specified to increase the conversation diversity. At the implementation level for dialogue creation, we establish a web interface for the experts and the workers that display the instructions and different aspect categories along with the aspect terms belonging to a particular domain next to the ongoing dialogue creation. This assists the participants in creating good conversations while referring to the guidelines and the different aspects information pertaining to a domain without stopping the conversation. Though we follow a known approach (Wizard-of-oz) for data creation as done in the existing works (Budzianowski et al., 2018;Peskov et al., 2019;Saha et al., 2018), our MDMMD dataset constitutes of more varied responses belonging to the multiple domains and having both textual and visual modalities.
To the best of our knowledge, this dataset is novel in the sense that it is created in full supervision of the experts and we explicitly monitor and guide the workers to participate in the process to create engaging, informative, and diverse conversations while focusing on the different aspects of a particular product/service. For example, in the restaurant domain, participants were advised to pretend that they were either interested in ordering food or looking for a fine place to dine. The different aspects associated with this domain like the type of cuisine (Chinese, Italian, Indian, etc), type of restaurant (cafes, lounges, etc), ambience, the meal type (dinner, breakfast, etc,), type of food (desserts, snacks, appetizers, etc) are provided for creating diverse conversations. They were asked to change their preferences in between the conversations (e.g. from Chinese they could shift into the Italian foods) for making it more challenging, real, and complex. Similarly, in case of the other domains, participants were instructed to follow the guidelines and make use of the different aspect categories for creating diverse, interesting, and en-   [AT]. Intuitively, the aspect category for a particular domain can be constant for a group of utterances but the aspect terms in every utterance may or may not be consistent. For example, the cuisine is the aspect category but Chinese is the aspect term that according to the user could change into Mexican, Japanese in the remaining utterances of a particular dialogue. Therefore, the labeling of both the aspect category and aspect term is essential for the generation of aspect guided responses to learn the subtle differences between the different aspect terms within the same category. By exploring the numerous internet sources used for data crawling, we compile a predefined list of aspect categories. The aspect terms for a particular category are also listed for every domain. Crowd members and experts were instructed to label the aspect categories in the interface provided for the creation of the dialogues from the predefined list along with the aspect terms contained in each utterance. The utterances with no aspect information, e.g. the starting and ending utterances of the dialogues, were marked with the None label to signify the absence. A group of 6 annotators was selected to verify the annotations done by the experts and the crowd workers on a set of 1500 dialogues. We observe the multi-rater Kappa agreement ratio of approximately 75%, which may be considered as a reliable estimate. Hence, from the survey, it can be concluded that the annotation done by the experts and crowd workers for both the aspect category and aspect terms were correct.

Dataset Statistics
The statistics of the complete dataset having all the three domains are provided in Table 2. The dataset is divided into train, test, and validation with 75%, 15%, and 10% conversations in each, respectively.

Methodology
For the proposed task, we assume that the aspect term information will be provided for the response to be generated. As different aspects are extremely subjective in a goal-oriented system, hence the responses are majorly dependent upon the respondent. Therefore, there can be several potential responses possible for a given input. Because of this subjectivity in goal-oriented systems, we like to focus on solving the task of generating responses with the desired aspect information.
Problem Definition: Our current work addresses the task of aspect guided response generation in a multi-modal goal-oriented dialog system conditioned on the conversational history having both textual and visual information. To be more specific, given an utterance U k = (w k,1 , w k,2 , ..., w k,n ), a set of images I k = (i k,1 , i k,2 , ..., i k,j ), and a conversational history H k = ((U 1 , I 1 ), (U 2 , I 2 ), ..., (U k−1 , I k−1 )) and the aspect term V a the task is to generate the next textual response Y = (y 1 , y 2 , ....., y n ), where n and n are the given input utterance and response length, respectively.

Background
Graph convolutional networks (GCNs) work on a graph structure and compute representations for the graph nodes by looking at the node's neighbourhood. Precisely, let G = (V, E) denote a directed graph, where V is the set of nodes (let |V | = i) and E is the set of edges. The input feature matrix having i nodes is represented by X ∈ R i×j , whereas each node n k (k ∈ V ) is denoted by an i-dimensional feature vector. By stacking m layers of GCNs, we can account for the neighbours that are m-hops away from the current node. The hidden representation of a 1-layer GCN is a matrix H ∈ R i×p where each p-dimensional representation of a node captures the interaction with its 1-hop neighbors. Multiple layers of GCNs can be stacked together to seize interactions with nodes that are several hops away. In particular, node v representation after the m th layer of GCN can be formulated as: (1) Here, h m k is the representation of the k th node in the (m−1) th GCN layer and h 1 k = n k ; and dir(k, l) illustrates whether the information flows from k to l, l to k or k = l; ∀ v ∈ V.

Model Description
1. Utterance Encoder For a given utterance U k , we employ a bidirectional Gated Recurrent Units (Bi-GRU) (Cho et al., 2014) to encode each word w k,i , where i ∈ (1, 2, 3, .....n) having d-dimensional embedding vectors into the hidden representation h U k ,i . We concatenate the last hidden representation from both the unidirectional GRUs to form the final hidden representation of a given utterance as follows: Now, consider the dependency parse tree of the current utterance denoted by T G = (V G , E G ). We use an utterance-specific GCN to operate on T G , which takes {h txt U k ,i } |G| i=1 as the input to the first GCN layer. The node representation in the m th hop of the utterance specific GCN is computed as: (3) ∀v ∈ V. Here, W m dir(k,l) and b m dir(k,l) are the edge direction specific utterance-GCN weights and biases for the m th hop and U 1 k = U k .
2. Image Encoder A pre-trained VGG-16 (Simonyan and Zisserman, 2015) having a 16-layer deep convolutional neural network (CNN) trained on more than millions of images present in the ImageNet dataset is used for encoding the images. As a result, the network can learn rich features from a wide range of images. Here, it is also used to extract the local image representation for all the images in the dialogue turns and concatenate them together. The concatenated image vector is passed through the linear layer to form the global image context representation as given below: where, W I and b I are the weight matrix and biases, respectively, which are the trainable parameters. In every turn, the maximum number of images i ≤ 6, so in-case of only text, vectors of zeros are considered in place of image representation. Figure 2: Architectural diagram of the proposed framework for aspect guided response generation 3. Context Encoder As shown in Figure 2, the final hidden representations from both image and text encoders are concatenated together for each turn and given as input to the context level GRU. A hierarchical encoder is built to model the conversational history that is placed over the text and image encoders. The decoder GRU is initialized by the final hidden state of the context encoder.
where h ctx c,k is the final hidden representation of the context for a given turn.

Decoder
In the decoding section, we build another GRU for generating the response in a sequential manner based on the context hidden representation of the hierarchical encoder (context GRU), and the words decoded previously. We use the input feeding decoding along with the attention (Luong et al., 2015) mechanism for enhancing the performance of the model. Using the decoder state h dec d,t as the query vector, we apply self-attention on the hidden representation of the context-level encoder. The decoder state and the context vector are concatenated and used to calculate a final distribution of the probability over the output tokens.
where, W f , W V and Wh are the trainable weight matrices.
For generating responses with the specified aspects as shown in Figure 2, we provide the aspect term embedding V a as input during decoding at every decoder time-step. In order to include the aspect vector in the decoder, we modify Equation (6) to incorporate the aspect information for the generation of responses and the modified equation is as follows: (7) 5. Training and Inference We employ commonly used teacher forcing (Williams and Zipser, 1989) algorithm at every decoding step to minimize the negative log-likelihood on the model distribution. We define y * = {y * 1 , y * 2 , . . . , y * m } as the ground-truth output sequence for a given input by: We apply uniform label smoothing (Szegedy et al., 2016) to alleviate the common issue of low diversity in dialogue systems, as suggested in (Jiang and de Rijke, 2018).

Baseline Models
Model 1 (HRED): The first baseline is a simple hierarchical encoder-decoder framework that makes use of only textual information for generating the responses.
Model 2 (MHRED): The second baseline model is the extension of the HRED framework, where we incorporate the multi-modal information i.e., the images for the generation of coherent responses.
Model 3 (HRED + Aspect): In this model at the decoder side, instead of only textual conversational information we add the desired aspect at the decoder side for generating aspect controlled responses.
Model 4 (MHRED + Aspect): To learn the aspect information at the decoder we provide the aspect information to the decoder along with the text and the visual representation.

Experimental Details
In this section we present the details of the experimental setup and evaluation metrics.
Implementation details All the implementations were done using the PyTorch 2 framework. For all the models including baselines, the batch size is set to 32. The utterance encoder is a bidirectional GRU with 600 hidden units in each direction. We use the dropout (Srivastava et al., 2014) with probability 0.45. During decoding, we use a beam search with beam size 10. The model is initialized with the parameters chosen randomly using a Gaussian distribution with the Xavier scheme (Glorot and Bengio, 2010). The hidden size for all the layers is 512. AMSGrad (Reddi et al., 2019) is used as the optimizer for model training to mitigate the slow convergence issues. We use uniform label smoothing with = 0.1 and perform gradient clipping when the gradient norm is above 5. We use 300-dimensional word-embedding initialized with Glove (Pennington et al., 2014) embedding pre-trained on Twitter. We consider the previous 2 turns for the dialogue history, and the maximum utterance length is set to 50. For image representation, FC6(4096 dimension) layer representation of the VGG-19 (Simonyan and Zisserman, 2015), pre-trained on ImageNet is used.
Automatic evaluation metrics To evaluate our proposed framework at the content level we report Perplexity (Chen et al., 1998). Lesser perplexity scores signify that the generated responses are grammatically correct and fluent. We also report the results using the standard metrics like BLEU-4 (Papineni et al., 2002) and Rouge-L (Lin, 2004) to measure the quality of the generated response for capturing the correct information.
Human evaluation metrics From the generated responses we randomly take 700 responses from the test dataset for qualitative evaluation. For a given input along with aspect information, three annotators with post-graduate exposure were assigned to evaluate the correctness, relevance, domain and aspect consistency of the generated responses by the different approaches for the following four metrics: (i) Fluency (F): This metric is used to measure the grammatical correctness of the generated response. It checks that the response is fluent and does not contain any errors; (ii). Relevance (R): It is used to judge whether the generated response is relevant to the conversational history; (iii). Aspect Appropriateness (AP): For this metric, we take care of the fact that the response generated is in consonance to the specified aspect (e.g. cuisine, color, type, etc) and is also coherent to the conversational history; (iv). Domain Consistency (DC): This metric is used to measure the consistency of the generated response in accordance with the domain being discussed. For the human evaluation metrics, we calculate the Fleiss' kappa (Fleiss, 1971) to determine the inter-rater consistency. For fluency and relevance, the kappa score is 0.75, and for aspect appropriateness and domain consistency is 0.77, indicating substantial agreement.

Results and Discussion
In this section we report the evaluation results along with the necessary analysis and discussions on these.
Automatic evaluation results: Evaluation results using automatic evaluation metrics are provided in Table 3. From the table, it is clear that the proposed approach outperforms all the baseline models and these improvements are statistically significant 3 .

Model Description
Perplexity BLEU-4 Rouge-L  As lower the perplexity better is the generated responses, hence, it is visible that the perplexity scores of the proposed M-GCN + Aspect model are the lowest among all the baseline models. As opposed to the text-based models, multi-modal frame-works, such as MHRED and M-GCN have lower scores for the perplexity exhibiting improvement in performance. The reduction in perplexity scores for the aspect guided models both text-based and multimodal frameworks further to ensure the robustness of these models for generating better responses. In the case of the BLEU-4 metric, we see that the proposed model M-GCN + Aspect having the ability to generate responses according to the specified aspect information achieves higher scores with an improvement of 6.2% from the MHRED + Aspect baseline model. The superior performance establishes the fact that the proposed model generates correct responses while preserving the information present in the ground-truth response as BLEU-4 compares the generated response to the groundtruth. Similarly, in the case of Rouge-L, there is an increase of 6.12% in comparison to the multimodal HRED framework. The significant jump in the performance entitles the fact that images play a crucial role in generating contextually correct responses. As our research focus is on aspect-guided response generation in multi-modal dialogue systems, we see that the frameworks having aspect information outperforms the other baseline models.
Human evaluation results: Along with the automatic evaluation, human evaluation is also essential for assessing the quality of the responses. Hence, for our specified task of generating responses in a multi-modal setup, we evaluate the baseline and our proposed model with the human evaluation metrics as mentioned. In Table 4, we present the results of human evaluation for all the baselines and the proposed model. The fluency scores of the baseline HRED model are the lowest for grammatically correct responses due to repetition and incomplete responses. The current work revolves around the aspect, hence the generated responses are assessed according to the specified aspects. It is evident from the results that the proposed framework generates responses that are appropriate to the specified aspects with an improvement of 8.46% from the MHRED + Aspect based baseline.
The improvement in the proposed model with aspect information provided additionally is significantly higher compared to the other methods. This is majorly due to the following facts: very precise and fine-grained information in the form of aspects of the products and/or services, better memory re-   . From the human evaluation, it can be concluded that the generated responses are not only fluent and relevant but also consistent with the domain and the specified aspect information.
Error analysis: To gain better insights, we closely analyze the outputs generated from our proposed system, and observe the following error scenarios: (i). Loss of information: The uni-modal baselines such as HRED generate responses that lack complete information. Gold: Here are the chairs in yellow color as in the 3rd image but not in round shapes as in the 5th image.; Predicted: The chairs are here but < unk > not in the shape. This indicates that the unavailability of multi-modal information (in this case, images) leads to the loss of information in the generated response. (ii). Contextually wrong domain: In some cases, our proposed framework generates the responses that are contextually incorrect with the domain. For example, with the aspect color the response generated belongs to the electronics domain, but the actual domain in the discussion is a restaurant. This type of error occurs due to the higher number of utterances with the color aspect belonging to the electronics domain in contrast to the restaurant domain. (iii). Mistakes in image identification: The baseline and proposed frameworks in some cases confuse the images being discussed leading to generating incorrect responses. As an example, Gold:I have beverages to go with the 2nd image but it is similar to the 4th one.; Predicted: I have got you beverages to go with the 4th image but nothing like the 3rd one. This indicates the model's inability to capture the correct positional information of the images. Also, the mention of different images in the contextual information confuses the model in selecting the correct images.

Conclusion and Future Work
Our current work emphasizes on the task of generating aspect-guided responses in a multi-modal dialogue system. We create a large scale task-oriented MDMMD dataset comprising of dyadic dialogues. The dataset comprises of three different domains, such as restaurant, electronics, and furniture. We develop a GCN based method to capture the textual representation, while we use VGG-19 for image representation. The context encoder captures the multi-modal information from the utterances. The representation from the context encoder along with the aspect vector is fed to the decoder for generating the aspect-guided responses. Experimental results show that our proposed methodology outperforms the baseline models in the case of both automatic and human evaluation metrics.
In future along with enhancing the architectural design of our proposed methodology, we would also like to investigate methods for image retrieval for complete multi-modal response generation. Furthermore, we would extend our method to deal with multiple aspects present in an utterance and generate the responses accordingly.
Chatbot, Sponsored by SERB, Govt. of India (IMP/2018/002072). Asif Ekbal acknowledges the Young Faculty Research Fellowship (YFRF), supported by Visvesvaraya Ph.D. scheme for Electronics and IT, Ministry of Electronics and Information Technology (MeitY), Government of India, being implemented by Digital India Corporation (formerly Media Lab Asia).