A Knowledge-Grounded Multimodal Search-Based Conversational Agent

Multimodal search-based dialogue is a challenging new task: It extends visually grounded question answering systems into multi-turn conversations with access to an external database. We address this new challenge by learning a neural response generation system from the recently released Multimodal Dialogue (MMD) dataset (Saha et al., 2017). We introduce a knowledge-grounded multimodal conversational model where an encoded knowledge base (KB) representation is appended to the decoder input. Our model substantially outperforms strong baselines in terms of text-based similarity measures (over 9 BLEU points, 3 of which are solely due to the use of additional information from the KB).

Our work builds upon the recently released Multimodal Dialogue (MMD) dataset (Saha et al., 2017), which contains dialogue sessions in the ecommerce (fashion) domain. Figure 1 illustrates an example chat session with multimodal interaction between the user and the system. We focus on the task of generating textual responses conditioned on the previous conversational history. Traditional goal-oriented dialogue systems relied on slot-filling approach to this task, i.e. explicit modelling of all attributes in the domain (Lemon et al., 2006;Wang and Lemon, 2013;Young et al., 2013). On the other hand, previous work on MMD data used direct learning from raw texts with implicit semantic representation only. This paper attempts to combine both approaches by learning to generate replies from raw user input, while also incorporating Knowledge Base (KB) inputs (i.e. explicit semantics) into the generation process. We discuss how our model is able to handle various user intents (request types) and the impact of incorporating the additional explicit semantic information from the KB into particular targeted intents. We use user intent annotation and KB queries provided with the dataset for the purpose of this work.
Our main contribution is the resulting fully data-driven model for the task of conversational multimodal dialogue generation, grounded in conversational text history, vision and KB inputs. We also illustrate a method to improve context modelling over multiple images and show great improvements over the baseline. Finally, we present a detailed analysis of the outputs generated by our system corresponding to different user intents.

Related Work
With recent progress in deep learning, there is continued interest in the tasks involving both vision and language, such as image captioning (Xu et al., 2015;Vinyals et al., 2015;Karpathy and Fei-Fei, 2015), visual storytelling (Huang et al., 2016), video description (Venugopalan et al., 2015b,a) or dialogue grounded in visual context (Antol et al., 2015;Das et al., 2017;Tapaswi et al., 2016). Bordes et al. (2016) and Ghazvininejad et al. (2017) presented knowledge-grounded neural models; however, these are uni-modal in nature, involve only textual interaction and do not take into account the conversational history in a dia- Figure 1: Example chatlog depicting multimodal user-agent interaction in a dialogue session from the MMD dataset. The system needs to ground knowledge to generate responses related to product-specific attributes. We focus on textual response generation given a fixed-size conversational history.
logue. In contrast, our system grounds on a KB while also conditioning on previous dialogue context which is multimodal in nature, consisting of both textual and visual communication between the user and the system. We formulate our KB input from a database query (triggered by the system) similar to Sha et al. (2018), as described in Section 3.2.
Our model belongs to the encoder-decoder paradigm where sequence-to-sequence models (Cho et al., 2014;Sutskever et al., 2014;Bahdanau et al., 2015) have become the de-facto standard for natural language generation. However, they tend to ignore the conversational history in a dialogue. The Hierarchical Recurrent Encoder Decoder (HRED) architecture (Serban et al., 2016(Serban et al., , 2017Lu et al., 2016) addresses this limitation by using a context recurrent neural network (RNN), forming a hierarchical encoder. We build upon these HRED models and refer to them as Text-only HREDs (T-HRED) in the following. Our model is most similar to the Multimodal HRED (M-HRED) of Saha et al. (2017), with context and KB exten- Figure 2: Schematic diagram of hierarchical encoder described in Section 3.1. Figure 3 depicts full pipeline of the model using knowledge base input. In contrast to Saha et al. (2017), we model over multiple images in a contextual dialogue turn by combining all 'local' representations of multiple images to a 'global' image representation per turn. We show a context of 2 turns for simplicity.

Knowledge grounded Multimodal
Conversational model While Saha et al. (2017) propose Multimodal HRED (M-HRED) by extending T-HRED to include visual context over images, they do not ground their dialogue context over an external database. Also, they limit the visual information by 'unrolling' multiple images to just use the last image of a single turn. For example in Figure 1, Saha et al. (2017) consider only the last image of trousers as visual context in Agent's response A4.
In contrast, we include all the images in a single turn using a linear layer (see Agarwal et al. (2018) for a detailed analysis). In addition, we devise a mechanism to ground our textual responses on a KB; Figure 3 depicts the full pipeline of our model. We combine textual and visual representations at the encoder level and pass it through the HRED's context encoder (cf. Figure 2), which learns the backbone of the conversation (see Section 3.1). Subsequently, we inject knowledge from the KB at the decoder level in each timestep (see Sections 3.2 and 3.3).
Formally, we model a dialogue as a sequence of utterances (turns) which are considered as sequences of words and images: Here t n represents the n-th utterance in a dialogue. The whole model is trained using cross entropy on next-word prediction: In the following, we explain all the different components of our model. We use the following nota- are all GRU cells (Cho et al., 2014) and g enc θ is a Convolutional Neural Network (CNN) image encoder. θ represent our model weights. w n,m is the m-th word in the n-th textual utterance. Similarly, q m,n and c m,n represent input at each timestep in the query and entity encoder (see Section 3.2).

Hierarchical Encoder
The encoder is formed of the following modules: Utterance (Text) encoder: We pass each utterance (previous system responses as well as current user query) in a given context through a text encoder. We use bidirectional GRU (f text θ ) to generate the textual representation h text n,Mn (cf. Eq. (3)). These textual representations are combined with image representations in each turn, forming the input for the context encoder.
Image encoder: We first extract the 'local' image representations for all images in a dialogue turn (denoted by g enc θ (img k ) in Eq. (4)) and concatenate them together. 1 This concatenated vector is passed through a linear layer to form the 'global' image context for a single turn, denoted by h img n .
h img n = l img ([g enc θ (img 1 ), . . . g enc θ (img k )]) (4) 1 We used the VGGnet (Simonyan and Zisserman, 2015) CNN to obtain the local image representations. Since the number of images in a turn is ≤ 5, we consider zero vectors in the absence of images.
Context encoder: The final hidden representations from both text encoder h text n,Mn and image encoder h img n are concatenated for each turn and serve as input to the context RNN (cf. Eq. (5)). On top of the text and image encoder, this builds a hierarchical encoder modelling the dialogue history. The final hidden state of the context RNN h cxt N acts as the initial state of the decoder RNN defined in Section 3.3.

Knowledge base (KB) input
The KB vector h kb n in Eq. (8) is formed by concatenating the h query n and h ent n representations. While our approach is modelled around the MMD dataset which provides contextual KB queries and profiles of celebrities endorsing specific products, it can be applied to other KBs with encoded queries and (optionally) properties of relevant entities.
Query encoder: Each chat session contains multiple queries to the database which retrieve the relevant product suited to user requirements at specific turn. We replicate this query for subsequent dialogue turns until a new query is triggered by the system. This query acts as knowledge base for the model at each turn. We show a sample input to the model in Figure 4. We used unidirectional GRU cell to encode the query input h query n . Entity encoder: The input to the entity encoder is a list of entities relevant to the query at hand (see Figure 5). GRU cells are used to produce the resulting h ent n . Specifically, the MMD dataset categorises products into synonym sets (synsets) and Query: "search_criteria": { "name": {"driving shoes": 1.0}, "fit":{"tight": 1.0}, "brand": {"cirohuner": 1.0}, "image_type":{"front": 1.0}, "gender": {"men": 1.0}, "print": {"chain": 1.0} } Knowledge base input: name driving shoes fit tight brand cirohuner image_type front gender men print chain Figure 4: Sample query to the database and corresponding knowledge base input vector.

1.
User: what kind of trousers are endorsed by celebrity cel_237? Intent: celebrity Subintent :does_celebrity_endorse_n Celebrity: cel_237 Celebrity input: boxer briefs 2. User: which of the celebrities usually wear similar looking canvas shoes as in the 2nd image Intent: celebrity Subintent: which_celebrity_endorses_n Synset: canvas shoes Celebrity input: cel_987 cel_2 cel_316 cel_101 Figure 5: Two input scenarios for the entity encoder depending on the fine grained user intent. If there is no 'celebrity' intent, we have an empty string as input to the entity encoder.
provides a list of celebrities endorsing each synset (see Section 4.1 for details).
This input is used specifically for the 'celebrity' intent in our model, where the user asks about celebrities endorsing a product. For each target prediction with celebrity intent, we first extract the relevant celebrity profiles using basic pattern matching over the user utterance. For each of the celebrities in the user query, we order the corresponding synsets by their probability of endorsement. If no celebrity is found, we use synset information from the query to extract celebrities which endorse the corresponding synset.

Input feeding decoder
We use an input feeding decoder with the attention mechanism of Luong et al. (2015). We concatenate the KB input h kb n with the decoder input (cf. Eq. (10), where h dec n,0 = h cxt N ). The rationale behind this late fusion of KB representation is that KB input remains the same for a given context and does not change on each turn. On the other hand, images and textual response together form a context in a dialogue turn and thus we fuse them early at the encoder level. The decoder is trained using cross-entropy loss defined in Eq. (2). h dec n,m = f dec θ (h dec n,m−1 , w n,m , h cxt n−1 , h kb n−1 ) (10)

Dataset
Our work is based on the Multimodal Dialogue (MMD) dataset (Saha et al., 2017), which consists of 150k chat sessions. 2 User queries can be complex from the perspective of multimodal taskspecific dialogue, such as "Show me more images of the 3rd product in some different directions". However, it also heavily relies on the external KB to answer product attributes related to user queries, such as "What is the brand/material of the suit in 3rd image?" or "Show something similar to 1st result but in a different material". This dataset contains raw chat logs as well as metadata information of the corresponding products. Around 400 anonymised celebrity profiles have been introduced in the system to emulate endorsement in recommendation, such as "What kind of slippers are endorsed by cel 145?". For each dialogue turn, there are manual annotations of the user intent available. We use the intents to construct celebrity encodings. On average, each session contains 40 dialogue turns. The system response depends on the intent state of the user query and on average contains 8 words and 4 images per utterance. We created our own version of the dataset from the raw chat logs of the dialogue session and metadata information. As discussed in Section 3.1, this was necessary to model the visual context over multiple images. We created the KB input to our model as described in Section 3.2 from the raw chat logs and the metadata information.

Implementation
We used PyTorch 3 (Paszke et al., 2017) for our experiments. 4 We did not use any kind of delexicalisation 5 and rely on our model to directly learn from the conversational history and KB. All encoders and decoders are based on 1-layer GRU cells (Cho et al., 2014) with 512 as the hidden state size. We used the 4096 dimensional FC6 layer image representations from VGG-19 (Simonyan and Zisserman, 2015) provided by Saha et al. (2017). Adam (Kingma and Ba, 2015) was chosen as the optimizer, and we clipped gradients greater than 5. We experimented with different learning rates and settled on the value of 0.0004. Dropout of 0.3 is applied to all the RNN cells to avoid overfitting, and we perform early stopping by tracking the validation loss (with single trial for each experiment).

Analysis and Results
We evaluate our response generation using the BLEU (Papineni et al., 2002), METEOR (Lavie and Agarwal, 2007) and ROUGE-L (Lin and Och, 2004) automatic metrics. 6 We reproduce the baseline results from Saha et al. (2017) Table 1 summarises the results for our M-HRED model without incorporating KB information. Attention-based models consistently outperform their counterparts. Adding the visual inputs does not lead to major improvements (M-HRED vs. T-HRED for a given context). However, 6 We used the evaluation scripts provided by (Sharma et al., 2017).  Table 2: BLEU scores for the entire corpus predictions for specific intents with a context of 5.
grounding in KB gave a stark uplift (M-HREDattn-kb vs. M-HRED-attn) for a given context size. Adding KB input boosts performance more for a shorter context compared to longer context. It can be conjectured that the longer context contains some of the information that is in the KB queries and so there is less impact of the KB input when we include the longer context. Compare the difference for M-HRED-attn-kb vs. M-HRED-attn for a context of 2 (3 BLEU points) vs. 5 (2 BLEU points) in Table 1. Conversely, longer context improves more the models without KB queries.
In summary, our best performing model (M-HRED-attn-kb) outperforms the model of Saha et al. (2017) by 9 BLEU points. We also analysed our generated outputs for different user intents, as shown in Table 2. As assumed, intents such as 'show-similar-to' and 'sort-results' are relatively easy from the perspective of NLG, requiring no information about the product description; our model matches the reference almost perfectly.
We found great improvements for the 'askattribute' intent where the KB-grounded model could answer correctly questions related to brand or colour and other attributes of the product, which resulted in an increase of 10 BLEU points on test instances with this user intent (M-HRED-attn-kb compared to M-HRED-attn). Similarly, in the example related to the 'buy' intent in Table 3, our model is able to learn that the product bought by the user is 'kurta', which probably cannot be captured by the visual features. Hence, M-HREDattn produces 'jeans' on the output. M-HREDattn-kb on the other hand learns this information from the KB. We also found that our BLEU score for the 'show-orientation' intent has decreased w.r.t. to the non-KB-grounded model. A detailed Table 3: Examples of predictions corresponding to different user intents, showcasing the effect of grounding in KB. We show textual context as well as relevant knowledge base input (and omit image context) for brevity's sake. While our model uses a context of 5, for simplicity, we show only 2 previous turns. probe found that the orientations for retrieved images may not directly follow the description in the query (KB). There are other intents for which even KB does not help, such as those requiring user modelling.

Conclusion and Future Work
This work focuses on the task of textual response generation in multimodal task-oriented dialogue system. We used the recently released Multimodal Dialogue (MMD) dataset (Saha et al., 2017) for experiments and introduced a novel conversational model grounded in language, vision and Knowledge Base (KB). Our best performing model outperforms the baseline model (Saha et al., 2017) by 9 BLEU points, improving context modelling in multimodal dialogue generation. Even though our model outputs showed a substantial improvement (over 3 BLEU points) on incorporating KB information, integrating visual context still remains a bottleneck, as also observed by Agrawal et al. (2016); Qian et al. (2018). This suggests the need for a better mechanism to encode visual context.
Since our KB-grounded model assumes user intent annotation and KB queries as additional inputs, we plan to build a model to provide them automatically.