Ordinal and Attribute Aware Response Generation in a Multimodal Dialogue System

Multimodal dialogue systems have opened new frontiers in the traditional goal-oriented dialogue systems. The state-of-the-art dialogue systems are primarily based on unimodal sources, predominantly the text, and hence cannot capture the information present in the other sources such as videos, audios, images etc. With the availability of large scale multimodal dialogue dataset (MMD) (Saha et al., 2018) on the fashion domain, the visual appearance of the products is essential for understanding the intention of the user. Without capturing the information from both the text and image, the system will be incapable of generating correct and desirable responses. In this paper, we propose a novel position and attribute aware attention mechanism to learn enhanced image representation conditioned on the user utterance. Our evaluation shows that the proposed model can generate appropriate responses while preserving the position and attribute information. Experimental results also prove that our proposed approach attains superior performance compared to the baseline models, and outperforms the state-of-the-art approaches on text similarity based evaluation metrics.


Introduction
With the advancement in Artificial Intelligence (AI), dialogue systems have become a prominent part in today's virtual assistant, which helps users to converse naturally with the system for effective task completion. Dialogue systems focus on two broad categories -open domain conversations with casual chit chat and goal-oriented systems where the system is designed to solve a particular task for the user belonging to a specific domain. Response generation is a crucial component of every conversational agent. The task of "how to say" * * First two authors are jointly the first authors the information to the user is the primary objective of every response generation module. One of the running goals of AI is to bring language and vision together in building robust dialogue systems. Advances in visual question answering (VQA) (Kim et al., 2016;Xiong et al., 2016;Ben-Younes et al., 2017), and image captioning (Anderson et al., 2018;Chen et al., 2018) have ensured interdisciplinary research in natural language processing (NLP) and computer vision. Recently, several works in dialogue systems incorporating both vision and language (Das et al., 2017a;Mostafazadeh et al., 2017) have shown promising research directions.
Goal oriented dialogue systems are majorly based on textual data (unimodal source). With increasing demands in the domains like retail, travel, entertainment, conversational agents that can converse by combining different modalities is an essential requirement for building the robust systems. Knowledge from different modalities carries complementary information about the various aspects of a product, event or activity of interest. By combining information from different modalities to learn better representation is crucial for creating robust dialogue systems. In a multimodal setup, the provision of different modalities assists both the user and the agent in achieving the desired goal. Our work is established upon the recently proposed Multimodal Dialogue (MMD) dataset (Saha et al., 2018), consisting of ecommerce (fashion domain) related conversations. The work focused on generating textual responses conditioned on the conversational history consisting of both text and image.
In the existing task-oriented dialogue systems, the inclusion of visually grounded dialogues-as in the case of MMD dataset-has provided exciting new challenges in the field of interactive dialogue systems. In contrast to VQA, multimodal dialogues have conversations with more extended contextual dependency, and a clear end-goal. As opposed to a static image in VQA, MMD deals with dynamic images making the task even more challenging. In comparison to the previous slotfilling dialogue systems on textual data (Young et al., 2013;Rieser and Lemon, 2011), MMD provides an additional visual modality to drive the conversation forward.
In this work, we propose an entirely data-driven response generation model in a multi-modal setup by combining the modalities of text and images. In Figure 1, we present an example from the MMD dataset. It is a conversation between the user and the system in a multimodal setting on the fashion domain. From the example, it is understood that the position of images is essential for the system to fulfill the demands of the user. For example, in figure, the U3 utterance "Can you tell me the type of colour in the 1st image" needs position information of the particular image from the given set of images. To handle such situations, we incorporate position embeddings to capture ordered visual information. The underlying motivation was to capture the correct image information from the text; hence, we use position aware attention mechanism. From Figure 1, in utterance U5, we can see that the user is keen on different aspects of the image as well. In this case, user is interested in the "print as in the 2nd image". To focus and capture the different attributes from the image representation being considered in the text, we apply attribute aware attention on the image representation. Hence in order to handle such situations present in the dataset, we apply both position and attribute aware attention mechanisms to capture intricate details from the image and textual features. For effective interaction among the modalities, we use Multimodal Factorized Bilinear (MFB) (Yu et al., 2017) pooling mechanism. Since multimodal feature distribution varies dramatically, hence the integrated image-text representations obtained by such linear models may not be sufficient in capturing the complex interactions between the visual and textual modalities. The information of the present utterance, image and the contextual history are essential for better response generation (Serban et al., 2015).
The key contributions/highlights of our current work are as follows: • We employ a position-aware attention mech-anism to incorporate the ordered visual information and attribute-aware attention mechanism to focus on image conditioned on the attributes discussed in the text.
• We utilize Multi-modal Factorized Bilinear (MFB) model to fuse the contextual information along with image and utterance representation.
• We achieve state-of-the-art performance for the textual response generation task on the MMD dataset.
The rest of the paper is structured as follows: In section 2, we discuss the related works. In Section 3, we explain the proposed methodology followed by the dataset description in Section 4. Experimental details and evaluation metrics are reported in Section 5. Results along with necessary analysis are presented in Section 6. In Section 7, we conclude the paper along with future research direction.

Related Work
Research on dialog systems have been a major attraction since a long time. In this section we briefly discuss some of the prominent research carried out on single and multi-modal dialog systems.

Unimodal Dialogue Systems
Dialogue systems have mostly focused on single modal source such as text. Hence, there have been  several works carried out on data-driven textual response generation. To help the users achieve their desired goals, response generation provides the medium through which a conversational agent can communicate with its user. In (Ritter et al., 2011), the authors used social media data for response generation following the machine translation approach. The effectiveness of deep learning has shown remarkable improvement in dialogue generation. Deep neural models have been quite beneficial for modelling conversations in (Vinyals and Le, 2015;Li et al., 2016a,b;Shang et al., 2015). A context-sensitive neural language model was proposed in , where the model chooses the most probable response given the textual conversational history. In (Serban et al., 2015(Serban et al., , 2017, the authors have proposed a hierarchical encoder-decoder model for capturing the dependencies in the utterances of a dialogue. Conditional auto-encoders have been employed in (Zhao et al.;Shen et al., 2018) that generate diverse replies by capturing discourse-level information in the encoder. Our current work differentiates from these existing works in dialogue systems in a way that we generate the appropriate responses by capturing information from both the text and image, conditioned on the conversational history.

Multimodal Dialogue Systems
With the recent shift in interdisciplinary research, dialogue systems combining different modalities (text, images, video) have been investigated for creating robust conversational agents. Dialogue generation combining information from text and images (Das et al., 2017a,b;Mostafazadeh et al., 2017;Gan et al., 2019;De Vries et al., 2017) has been successful in bridging the gap between vision and language. Our work differs from these as the conversation in Multimodal Dialogue (MMD) dataset (Saha et al., 2018) deals with multiple images and the growth in conversation is dependent on both image and text as opposed to a conversation with a single image. Lately, with the release of DSTC7 dataset, video and textual modalities have been explored in (Lin et al., 2019;Le et al., 2019). Prior works on MMD dataset reported in (Agarwal et al., 2018b,a;Liao et al., 2018) have captured the information in the form of knowledge bases using hierarchical encoder-decoder model.
Our work is different from these existing works on MMD dataset in the sense that we incorporate position and attribute aware attention mechanism for capturing ordered information and minute details such as colour, style etc. from the image representations for more accurate response generation. Our method, unlike the previous works, make use of the MFB technique for better information fusion across different modalities. The approach that we propose to capture and integrate information from image and text is novel. We successfully demonstrate the effectiveness of our proposed model in generating responses through sufficient empirical analysis.

Methodology
In this section we firstly define the problem and then present the details of the proposed method.

Problem Definition
In this paper, we address the task of textual response generation conditioned on conversational history as proposed in (Saha et al., 2018). The dialogue consists of text utterances along with multiple images and given a context of k turns the task here is to generate the next text response. More precisely, given an user utterance U k = (w k,1 , w k,2 , ...., w k,n ), a set of images I k = (img k,1 , img k,2 , ..., img k,n ) and a conversational history H k = ((U 1 , I 1 ), (U 2 , I 2 ), ..., (U k−1 , I k−1 )) the task is to generate the next textual response Y k = (y k,1 , y k,2 , ....., y k,n ).

Hierarchical Encoder Decoder
We construct a response generation model, as shown in Figure 2(a), which is an extension of the recently introduced Hierarchical Encoder Decoder (HRED) architecture (Serban et al., 2016(Serban et al., , 2017. As opposed to the standard sequence to sequence models , the dialogue context is modelled by a separate context Recurrent Neural Network (RNN) over the encoder RNN, thus forming a hierarchical encoder. The multimodal HRED (MHRED) is built upon the HRED to include text and image modalities.
The key components of MHRED are the utterance encoder, image encoder, context encoder and decoder.
Utterance Encoder: Given an utterance U m , a bidirectional Gated Recurrent Units (BiGRU)  is employed to encode each word w m,i , i ∈ (1, ..., n) represented by ddimensional embeddings into the hidden vectors h m,U,i .
Image Encoder: A pre-trained VGG-19 model (Simonyan and Zisserman, 2014) is used to extract image features for all the images in a given dialogue turn. The concatenation of single image features is given as input to a single linear layer to obtain a global image context representation.
where W I and b I are the trainable weight matrix and biases, respectively. The number of images in a single turn is ≤ 5; hence, zero vectors are considered in the absence of images.
Context-level Encoder: The final hidden representations from both image as well as text encoders are concatenated for every turn and are fed as input to the context GRU, as shown in Figure  2(b). A hierarchical encoder is built on top of the image and text encoder to model the dialogue history. The final hidden state of the context GRU serves as the initial state of the decoder GRU.
Decoder: In the decoding stage, the decoder is another GRU that generates words sequentially conditioned on the final hidden state of the context GRU and the previously decoded words. Attention mechanism similar to (Luong et al., 2015) is incorporated to enhance the performance of the decoder GRU. The attention layer is applied to the hidden state of context encoder using decoder state d t as the query vector. The concatenation of the context vector and the decoder state is used to compute a final probability distribution over the output tokens.
where, W h , W V and Wh are trainable weight matrices.

Proposed Model
To improve the performance of the MHRED model, rather than just concatenating the representations of the text and image encoder we apply an attention layer to mask out the irrelevant information. In our case, we apply attention to learn where to focus and what to focus upon as described in the user utterance. To decouple these two tasks we augment the encoder with position and attribute aware attention mechanisms.

Position-aware Attention:
In the baseline MHRED model, we incorporate position information of the images to improve the performance of the system. For example, "List more in colour as the 4th image and style as in the 1st image", the ordered information of the images is essential for the correct textual response by the agent to satisfy the needs of the user. Hence, the knowledge of every image with respect to its position is necessary so that the agent can capture the information and fulfill the objective of the customer. The lack of position information of the images in the baseline MHRED model causes quite a few errors in focusing on the right image. To alleviate this issue, we fuse position embedding of every image with the corresponding image features. The position of every image is represented by position embedding P E i , where, P E = [P E 1 , ..., P E n ]. This information is concatenated to the corresponding image features. To compute self attention (Wang et al., 2017) we represent textual features as We use the self-attended text embedding as a query vector U p to calculate the attention distribution over the position embedding P E. Finally, in our proposed model, as shown in Figure 3, we incorporate position-aware and attributeaware attention mechanisms to provide focused information conditioned on the text utterance. We concatenate U a and U p vectors for the final utterance representations U f , I a and I p vectors as the final image representation I f . The output of the context encoder h c along with I f and U f serves as input to the MFB module. Here, we compute the MFB between I f and U f .
where, W m and W m are the trainable parameters, and SumP ooling function is same as described in (Gan et al., 2019). Similarly, we take a pairwise combination of I f , U f and h c as the final output of our multimodal fusion module. Hence, the final multimodal fusion can be represented by where h d is used to initialize the decoder.

Training and Inference
We employ commonly used teacher forcing (Williams and Zipser, 1989) algorithm at every decoding step to minimize negative log-likelihood on the model distribution. We define y * = {y * 1 , y * 2 , . . . , y * m } as the ground-truth output sequence for a given input We apply uniform label smoothing (Szegedy et al., 2016) to alleviate the common issue of low diversity in dialogue systems, as suggested in (Jiang and de Rijke, 2018).

Baseline Models
For our experiment, we develop the following models: Model 1 (MHRED): The first model is the baseline MHRED model described in Section 3.2.
Model 2 (MHRED + A): In this model, we apply attention (A) on the text and image features rather than merely concatenating the features.
Model 3 (MHRED + A + PE): In this model, position embeddings (PE) of every image is concatenated with the respective image features to provide ordered visual information of the images.
Model 4 (MHRED + PA): Self-attention on the text representations with respect to position information is computed to generate a query vector. This query vector is used to learn the attention distribution on the position embeddings to focus on the discussed image in user utterance.
Model 5 (MHRED + AA): To learn the different attributes discussed in the text we apply selfattention on the text representation and compute a query vector that attends the image representation in accordance to the attributes in the text.
Model 6 (MHRED + PA + AA): In this model, the final text and image representations, denoted as U f and I f , respectively, and obtained after applying the position and attribute aware attention, are concatenated and fed as input to the context encoder.
Model 7 (MHRED + MFB(I, T)): MFB module is employed to learn the complex association between the textual and visual features. The final text representation (T) U f and the final image representation (I) I f are fed as input to the MFB module.
Model 8 (MHRED + MFB(I,T,C)): In this model, we concatenate the pairwise output of the MFB module on the contextual information (C), that is the output of context encoder h c,i along with text and image representations.

Datasets
Our work is built upon the Multimodal Dialogue (MMD) dataset (Saha et al., 2018). The MMD dataset comprises of 150k chat sessions between the customer and sales agent. Table 1 lists the detailed information about the MMD dataset. Domain-specific knowledge in the fashion domain was captured during the series of customer-agent interactions. The dialogues incorporate text and image information seamlessly in a conversation bringing together multiple modalities for creating advanced dialogue systems. The dataset poses new challenges for multimodal, goal-oriented dialogue containing complex user utterances. For example, "Can you show me the 5th image in different orientations within my budget?", requires quantitative inference such as filtering, counting and sorting. Bringing the textual and image modalities together, multimodal inference makes the task of generation even more challenging, for example, "See the second stilettos, I want to see more like it but in a different colour". In our work, we use a different version of the dataset as described in (Agarwal et al., 2018a,b) to capture the multiple images, in turn, as one concatenated context vector for every turn in a given dialogue.

Experiments
In this section we present the implementation details and the evaluation metrics (automatic and human) that we use for measuring the model performance.

Implementation Details
All the implementations are done using the Py-Torch 1 framework. We use 512-dimensional word embedding and 10-dimensional position embedding as described in (Vaswani et al., 2017). We use the dropout (Srivastava et al., 2014) with probability 0.45. During decoding, we use a beam search with beam size 10. We initialize the model parameters randomly using a Gaussian distribution with Xavier scheme (Glorot and Bengio, 2010). The hidden size for all the layers is 512. We employ AMSGrad (Reddi et al., 2019) as the optimizer for model training to mitigate the slow convergence issues. We use uniform label smoothing with = 0.1 and perform gradient clipping when gradient norm is over 5. For image representation, FC6(4096 dimension) layer representation of the VGG-19 (Simonyan and Zisserman, 2014), pretrained on ImageNet is used.

Automatic Evaluation
For evaluating the model we report the standard metrics like BLEU-4 (Papineni et al., 2002), ROUGE-L (Lin, 2004) and METEOR (Lavie and Agarwal, 2007) employing the evaluation scripts made available by (Sharma et al., 2017).

Human Evaluation
To understand the quality of responses, we adopt human evaluation to compare the performance of different models. We randomly sample 700 responses from the test set for human evaluation. Given an utterance, image along with the conversation history were presented to three human annotators, with post-graduate level of exposure.
They were asked to measure the correctness and relevance of the responses generated by the different models with respect to the following three metrics: 1. Fluency (F): The generated response is grammatically correct and is free of any errors. 2. Relevance (R): The generated response is in accordance to the aspect being discussed (style, colour, material, etc.), and contains the information with respect to the conversational history. Also, there is no loss of attributes/information in the generated response.
We follow the scoring scheme for fluency and relevance as-0: incorrect or incomplete, 1: moderately correct, and 2: correct. We compute the Fleiss' kappa (Fleiss, 1971) for the above metrics to measure inter-rater consistency. The kappa score for fluency is 0.75 and relevance is 0.77 indicating "substantial agreement".

Results and Analysis
In this section we present the detailed experimental results using automatic and human evaluation metrics both. In addition we also report the errors that our current model encounters.

Automatic Evaluation Results
Results of the different models are presented in Table 2. The proposed model performs better than the other baselines for all the evaluation metrics, and we find this improvement to be statistically

Description
Model BLEU 4 METEOR ROUGE L State-of -the-arts MHRED-attn (Agarwal et al., 2018a) 0.4451 0.3371 0.6799 MHRED-attn-kb (Agarwal et al., 2018b)    The results are reported for context size 5 due to its superior performance in comparison to the context size 2, as shown in (Agarwal et al., 2018a,b). The MHRED model is a decent baseline with good scores (0.6725 ROUGE-L, 0.4454 BLEU). The application of attention over the text and image representations, as opposed to the concatenation, provides an absolute improvement of (+0.85%) in METEOR as well as in the other metrics. To give the ordered visual information in Model 3, we incorporate positional embedding for the images which boost the performance of text generation by (+0.94%) in BLEU score and (+0.58%) in ROUGE-L. The improved performance shows the effectiveness of position embedding for the images in a multimodal dialogue setting. The efficiency of position-aware and attribute-aware attention mechanism (Model 6) can be seen in the increased performance of the model with respect to Model 4 and Model 5 with an improvement of 0.68% and 0.6% in ROUGE-L metric, respectively. The MFB based fusion technique helps to improve the performance of the generation model (Model 8) with an improvement of 3.82% in BLEU score with respect to the baseline model, whereas it shows 0.26% improvement in BLEU score in comparison to Model 6. The final proposed model (MHRED + PA + AA + MFB(I,T,C)) after incorporating the position and attribute aware attention mechanisms along with MFB fusion attains the state-of-theart performance with an improvement of 3.23% in BLEU score, 3.31% in ROUGE-L and 2.34% in METEOR in comparison to the existing approaches (Agarwal et al., 2018b). Example 1 in the figure shows that the model can focus on the correct image (in this case, the 3rd image) with the help of position-aware attention mechanism as the focus is given to the word 3rd in the utterance. Example 2 shows the effect of both position and attribute aware attention mechanism that helps in more accurate response generation. The positional word 2nd along with the attribute rubber has obtained maximum focus in the given example. While in Example 3, we can see the effect of attribute aware attention mechanism with maximum attention given to the keywords such as dark, red, frame in the utterance.

Human Evaluation Results
In Table 3, we present the evaluation results of human. In case of fluency, the baseline MHRED model and the proposed model have shown quite similar performance. While for the relevance metric our proposed model has shown better performance with an improvement of 7.47% in generating the correct responses. This may be due the reason that our proposed model focuses on the relevant information in the text as well as the image, and generate more accurate and informative responses. All the results are statistically significant as we perform Welch's t-test (Welch, 1947) and it is conducted at 5% (0.05) significance level.

Error Analysis
We analyse the outputs generated from our proposed model to perform a detailed qualitative anal-   ysis of the responses. In Figure 5, we present a few examples of the responses generated by the different models given the image and utterance as an input. Some commonly occurring errors include: 1. Unknown tokens: As the baseline MHRED model uses the basic sequence to sequence framework, the number of unknown tokens is predicted the most in this case. The model also often predicts 'end of sequence' token just after the 'out of vocabulary' token, thus leaving sequences incomplete. Gold: ..the type of the chinos is cargo in the 1st and 2nd image; Predicted: .. the type 2. Extra information: The proposed model sometimes generates extra informative sentences than in the ground-truth response due to multiple occurrences of these attributes together in the data: Gold: the jackets in the 1st, 2nd and 5th images will suit well for dry clean; Predicted: the jackets in the 1st, 2nd and 5th images will suit well for dry clean, regular, cold, hand clean.
3. Repetition: The baseline, as well as the proposed model in a few cases, go on repeating the information present in a given utterance: Gold: it can go well with cropped type navy sweater; Predicted: it can go well with navy style, navy neck, navy style, navy neck sweater and with.
4. Incorrect Products: The model generates the incorrect products in the predicted utterance as compared to the one present in the original utterance as different products have similar attributes: Gold: it can go well with unique branded, black colouring, chic type hand bag; Predicted: it can go well with black frame colour sunglasses.
5. Wrong choice of images: The model focuses on incorrect images with respect to the conversational history due to the discussion over multiple images in history. Gold: the upper material in the 2nd image is rubber lace; Predicted: the upper material in the 4th image is leather.

Conclusion
In this paper, we have proposed an ordinal and attribute aware attention mechanism for natural language generation exploiting images and texts. In a multimodal setting, the information sharing between the modalities is significant for proper response generation, thereby leading to customer satisfaction. We incorporate the MFB fusing technique along with position and attribute aware attention mechanism for effective knowledge integration from the textual and visual modalities. On the recently released MMD dataset, the incorporation of our proposed techniques has shown improved performance for the task of textual response generation. In qualitative and quantitative analyses of the generated responses, we have observed contextually correct and informative responses, along with minor inaccuracies as discussed in the error analysis section. Overall the performance of our model shows the variations and more accurate responses in comparison to the other models keeping the attribute and position information of the generated responses intact.
In future, along with the opportunity of extending the architectural design and training methodologies to enhance the performance of our systems, we look forward to designing a specific component to enhance the natural language generation component of an end-to-end chatbot, by including image generation and retrieval systems for the completion of a multimodal dialogue system.