Natural Answer Generation with Heterogeneous Memory

Memory augmented encoder-decoder framework has achieved promising progress for natural language generation tasks. Such frameworks enable a decoder to retrieve from a memory during generation. However, less research has been done to take care of the memory contents from different sources, which are often of heterogeneous formats. In this work, we propose a novel attention mechanism to encourage the decoder to actively interact with the memory by taking its heterogeneity into account. Our solution attends across the generated history and memory to explicitly avoid repetition, and introduce related knowledge to enrich our generated sentences. Experiments on the answer sentence generation task show that our method can effectively explore heterogeneous memory to produce readable and meaningful answer sentences while maintaining high coverage for given answer information.


Introduction
Most previous question answering systems focus on finding candidate words, phrases or sentence snippets from many resources, and ranking them for their users (Chu-Carroll et al., 2004;Xu et al., 2016). Typically, candidate answers are collected from different resources, such as knowledge base (KB) or textual documents, which are often with heterogeneous formats, e.g., KB triples or semi-structured results from Information Extraction (IE). For factoid questions, a single answer word or phrase is chosen as the response for users, as shown in Table 1 (A1). However, in many real-world scenarios, users may prefer more natural responses rather than a single word. For example, as A2 in Table 1, James Cameron directed the Titanic. is more favorable than the single name James Cameron. A straightforward solution to compose an answer sentence is to build a template based model, where the answer Q Who is the director of the Titanic? A1 James Cameron A2 James Cameron directed the Titanic.

A3
James Cameron directed it.

A4
James Cameron directed it in 1999. word James Cameron and topic word in the question the Titanic are filled into a pre-defined template (Chu-Carroll et al., 2004). But such systems intrinsically lack variety, hence hard to generalize to new domains.
To produce more natural answer sentences, Yin et al. (2015) proposed GenQA, an encoderdecoder based model to select candidate answers from a KB styled memory during decoding to generate an answer sentence. CoreQA (He et al., 2017b) further extended GenQA with a copy mechanism to learn to copy words from the question. The application of attention mechanism enables those attempts to successfully learn sentence varieties from the memory and training data, such as usage of pronouns (A3 in Table 1). However, since they are within the encoder-decoder framework, they also encounter the well noticed repetition issue: due to loss of temporary decoder state, an RNN based decoder may repeat what has already been said during generation (Tu et al., 2016a,b).
Both GenQA and CoreQA are designed to work with a structured KB as the memory, while in most real-world scenarios, we require knowledge from different resources, hence of different formats. This knowledge may come from structured KBs, documents, or even tables. It is admittedly challenging to leverage a heterogeneous memory in a neural generation framework, and it is not well studied in previous works (Miller et al., 2016). Here in our case, the memory should contain two main formats: KB triples and semi-structured en-tities from IE, forming a heterogeneous memory (HM). The former is usually organized in is a subject-predicate-object form, while, the latter is usually extracted from textual documents, in the form of keywords, sometimes associated with certain categories or tags oriented to specific tasks (Bordes and Weston, 2016). Miller et al. (2016) discuss different knowledge representations for a simple factoid QA task and show that classic structured KBs organized in a Key-Value Memory style work the best. However, dealing with heterogeneous memory is not trivial. Figure 1 shows an example of generating answer sentences from HM in a Key-Value style, which is indeed more challenging than only using a classic KB memory. Keys and values play different roles during decoding. A director key indicates this slot contains the answer. Same James Cameron values with different keys indicate duplication. The decoder needs this information to proactively perform memory addressing. Because keys from documents are not canonicalized, e.g., doc directed and doc director, they may lead to redundancy with the structured KB, e.g., kb directed_by and doc director. A decoder could repetitively output a director twice simply because there are two different memory slots hit by the query, both indicating the same director. This will make the the repetition issue even worse.
Although many neural generation systems can produce coherent answer sentences, they often focus on how to guarantee the chosen answer words to appear in the output, while ignoring many related or meaningful background information in the memory that can further improve user experiences. In real-world applications like chatbots or personal assistants, users may want to know not only the exact answer word, but also information related to the answers or the questions. This information is potentially helpful to attract users' attention, and make the output sentences more natural. For example in Table 1 (A4), the extra 1999 not only enriches the answer with the movie's release year, but also can act as a clue to help distinguish ambiguous candidate answers, e.g., Titanic (1999) and Titanic (HD, 2016).
In this paper, we propose a sequence to sequence model tailing for heterogeneous memory. In order to bridge the gap between decoder states and memory heterogeneity, we split decoder states into separate vectors, which can be used to address Figure 1: An example qa-pair with heterogeneous memory different memory components explicitly. To avoid redundancy, we propose the Cumulative Attention mechanism, which uses the context of the decoder history to address the memory, thus reduces repetition at memory addressing time. We conduct experiments on two WikiMovies datasets, and experimental results show that our model is able to generate natural answer sentences composed of extra related facts about the question.

Related Work
Natural Answer Generation with Sequence to Sequence Learning: Sequence to sequence models (with attention) have achieved successful results in many NLP tasks Bahdanau et al., 2014;Vinyals et al., 2015;See et al., 2017). Memory is an effective way to equip seq2seq systems with external information (Weston et al., 2014;Sukhbaatar et al., 2015;Miller et al., 2016;Kumar et al., 2015). GenQA (Yin et al., 2015) applies a seq2seq model to generate natural answer sentences from a knowledge base, and CoreQA (He et al., 2017b) extends it with copying mechanism (Gu et al., 2016). But they do not consider the heterogeneity of the memory, only tackle questions with one single answer word, and do not study information enrichment. Memory and Attention: There are also increasing works focusing on different memory representations and the interaction between the decoder and memory, i.e., attention. Miller et al. (2016) propose the Key-Value style memory to explore textual knowledge (both structured and unstructured) from different sources, but they still utilize them separately, without a uniform addressing and attention mechanism. Daniluk et al. (2017) split the decoder states into key and value representation, and increase language modeling performance. Multiple variants of attention mechanism have also been studied. Sukhbaatar et al. (2015) introduce multi-hop attention, and extend it to convolutional sequence to sequence learning (Gehring et al., 2017). Kumar et al. (2015) further extend it by using a Gated Recurrent Unit (Chung et al., 2014) between hops. These models show that multiple hops may increase the model's ability to reason. These multi-hop attention is performed within a single homogeneous memory. Our Cumulative Attention is inspired by them, but we utilize it cross different memory, hence can explicitly reason over different memory components.
Conditional Sentence Generation: Controllable sentence generation with external information is wildly studied from different views. From the task perspective, Fan et al. (2017) utilize label information for generation, and tackle information coverage in a summarization task. He et al. (2017a) use recursive Network to represent knowledge base, and Bordes and Weston (2016) track generation states and provide information enrichment, both are in a dialog setting. In terms of network architecture, Wen et al. (2015) equip LSTM with a semantic control cell to improve informativeness of generated sentence. Kiddon et al. (2016) propose the neural checklist model to explicitly track what has been mentioned and what left to say by splitting these two into different lists. Our model is related to these models with respect to information representation and challenges from coverage and redundancy. The most closely related one is the checklist model. But it does not explicitly study information redundancy. Also, the information we track is heterogeneous, and we track it in a different way, i.e. using Cumulative attention.
Due to loss of states across time steps, the decoder may generate duplicate outputs. Attempts have been made to address this problem. Some architectures try to utilize History attention records. See et al. (2017) introduce a coverage mechanism, and Paulus et al. (2017) use history attention weights to normalize new attention. Others are featured in network modules. Suzuki and Nagata (2017) estimate the frequency of target words and record the occurrence. Our model shows that simply attending to history decoder states can reduce redundancy. Then we use the context vector of attention to history decoder states to perform attention to the memory. Doing this enables the decoder to correctly decide what to say at mem-ory addressing time, rather than decoding time, thus increasing answer coverage and information enrichment.

Task Definition
Given a question q and a memory M storing related information, our task is to retrieve all the answer words from the memory, generate an answer sentence x, and use the rest information as enrichment.
Answer Coverage is the primary objective of our task. Since many answers contain multiple words, the system needs to cover all the target words.
Information Redundancy is one challenge for this task. It is well noticed that the decoder language model may lose track of its state, thus repeating itself. Also, the decoder needs to reason over the semantic gap between heterogeneous memory slots, figuring out different keys may refer to the same value. These two kinds of redundancy should both be addressed.
Information Enrichment is another challenge. It requires the decoder to interact with the memory effectively and use the right word to enrich the answer.
The tradeoff between redundancy and coverage/enrichment is one of our main considerations. This is because when the decoder generates a word, it either generates a new word or a mentioned word. The more answer words and information enrichment are considered, the more likely the model repeats what it has already generated. We use a key-value memory M to represent the information heterogeneity. In our experiments, we study information from KB, topic words, and words extracted from documents. The memory is formatted as ( are respectively the key embedding and word embedding for the i-th memory slot. The vocabulary for keys V key consists of all predicates in the KB, and all tags we use to classify the value words (e.g: director, actor, or release_year). The vocabulary for values V val consists all related words from web documents, subjects and objects from the KB. This memory is later used in two ways: 1. the decoder uses its previous hidden state to perform attention and generate context vectors. 2. the decoder uses the updated hidden states as pointers (Vinyals et al., 2015) to retrieve the memory and copy the memory contents into the decoder's output.

Decoder with Heterogeneous States
As in the standard encoder-decoder architecture with attention, the word embedding of the decoder's previous time step x t and context vector c t is fed as the input of the next time step, and the hidden state h t is updated then. The initial hidden state is the question embedding concatenated with average memory key and value: where [·, ·] denotes concatenation.
As shown in figure 2, to match the key-value memory representation, we use three linear transformations to convert the decoder's current where the W s are initialized as identity matrix and h ⟨V ⟩ t will be used as pointers to perform attention to memory keys m ⟨K⟩ and values m ⟨V ⟩ , respectively, and forms two distributions: p We use the average of the two as distribution over the memory: p By doing this, we bridge the decoder's semantic space with the memory's semantic space, and explicitly maintains heterogeneity.
The decoder then uses a gating mechanism g = sigmoid(W g h t + b g ) to decide whether the output x t comes from the normal vocabulary or the memory. By mixing p ⟨N ⟩ t and p ⟨M ⟩ t with g, we get the distribution for the next decoder output: The three hs are then recorded as history states for later decoding time steps to perform the selfattention. We will explain this in the next section.

Cumulative Attention
As shown in Figure 3, our Cumulative Attention mechanism is exploited similarly to a multi-hop attention (Sukhbaatar et al., 2015). The difference is that the multi-hop attention uses context vector over one single memory at different hops, while our Cumulative Attention utilizes the context vector to query different memories. As shown in the left part of Figure 3  where c = attn(query, memory) denotes the attention function (Bahdanau et al., 2014), and the decoder's history states are defined as: which is then used together with h ⟨K⟩ and h ⟨V ⟩ to perform attention to m ⟨K⟩ and m ⟨V ⟩ , respectively: ] to form the current input to the decoder. The decoder takes the context vector, the previous output, and the previous state to update its state, then generates a distribution for the next token, as shown in Section 4.1. We use the greedy decoding approach and choose the word with the highest probability as the current output.
For optimization, we jointly optimize the negative log-probability of the output sentence and the cross entropy H for gate g. Since g is the probability about whether the current output comes from the memory or the vocabulary, we can extract the label for g by matching sentence words with the memory. The overall loss function L can be written as: We optimize L with gradient descent based optimizers.

Experiments
Our experiments are designed to answer the following questions: (1) whether our model can properly utilize heterogeneous memories to generate readable answer sentences, (2) whether our model can cover all target answers during generation, (3) whether our model can introduce related knowledge in the output while avoiding repetition.

Datasets
Our task requires a question, and a memory storing all the answer words and related knowledge as input, and produces a natural, readable sentence as the output. Unfortunately, there is no existing dataset that naturally fits to our task. We thus tailor the WikiMovies 1 dataset according to our requirements. This WikiMovies dataset was originally constructed for answering simple factoid questions, using memory networks with different knowledge representations, i.e., structured KB (KB entries in Table 2), raw textual documents (Doc), or processed documents obtained through information extraction (IE), respectively. The first is in the classic subject-predicate-object format. The second contains sentences from Wikipedia and also sentences automatically generated from predefined templates. The third is in the subjectverb-object format, collected by applying off-theshell information extractor to all sentences.  As shown in Table 2, we treat each question in WikiMovies with its original answer (usually one or more words) as a QA pair, and one of the question's supportive sentences (either from Wikipedia or templates) as its goldstandard answer sentence. For each question, the memory will contain all knowledge triples about the question's topic movie from the KB entries, and also include entities and keywords extracted from its IE portion. For each entry in KB entries, we use the predicate as the key and the object as value to construct a new entry in our memory. For those from IE, we keep the extracted tags as the key and entities or other expressions as the value. Given a question, if an entity/expression in the memory is not the answer, it will be treated as information enrichment. According to whether the supportive sentences are generated by predefined templates or not, we split the dataset into WikiMovies-Synthetic and WikiMovies-Wikipedia.
The resulting WikiMovies-Synthetic includes 115 question patterns and 194 answer patterns, covering 10 topics, e.g., director, genre, actor, release year, etc. We follow its original data split, i.e., 47,226 QA-pairs for training, 8,895 for validation and 8,910 for testing.
In WikiMovies-Wikipedia, answer sentences are extracted from Wikipedia, admittedly noisy in nature. Note that there are more than 10K Wikipedia sentences that cannot be paired with any questions. We thus left their questions as blank and treat it as a pure generation task from a given memory, which can be viewed as a form of data augmentation to improve sentence variety. We split WikiMovies-Wikipedia the dataset randomly into 47,309 cases for training, 4,093 for testing and 3,954 for validation. We treat normal words occurring less than 10 times as UNK, and, eventually, have 24,850 normal words and 37,898 entity words. We cut the maximum length of answer sentences to 20, and the maximum memory size to 10, which covers most cases in both synthetic and Wikipedia datasets.

Metrics
We evaluate our answer sentences in terms of answer coverage, information enrichment, and redundancy. For cases with only one answer word, we design C single to indicate the percentage of cases being correctly answered. Cases with more than one answer word are evaluated by C part , i.e., the percentage of answer words covered correctly, and C perf ect is the percentage of cases whose answers are perfectly covered. Here, the definition of coverage is similar in spirit with the conventional recall as both measure how many gold words are included in the output. Specifically, C part is essentially the same as recall with respect to its own cases. Note that perfect coverage is the most difficult, while single coverage is the easiest one. For Enrich, we measure the number of none-answer memory items included in the output. Regarding Redundancy, we calculate the times of repetition for memory values in the answer sentence. We also compute BLEU scores (Papineni et al., 2002) on the WikiMovies-Wikipedia, as an indicator of naturalness, to some extent.

Comparison Models
We compare our full model (HS-CumuAttn) with state-of-the-art answer generation models and constrained sentence generation models. Our first baseline is GenQA (Yin et al., 2015), a standard encoder-decoder model with attention mechanism. We equip it with our Key-Value style heterogeneous memory. We also compare with its two variants. HS-GenQA: we split its decoder state into heterogeneous representations. The other one,   GenQA-AttnHist, is enhanced with a history attention during decoding. CheckList (Kiddon et al., 2016) is the state-ofthe-art model for generating long sentences with large agenda to mention. It keeps words that have been mentioned and words to mention using two separate records, and updates the records dynamically during decoding. To adapt to our task, we modify CheckList with a question encoder and a KV memory.
We also compare with one variant of our own model, HS-AttnHist, which does not benefit from the Cumulative Attention.

Implementation
Our model is implemented with the Tensorflow framework 2 , version 1.2. We use the Adam optimizer (Kingma and Ba, 2014) with its default setting. The embedding dimension is set to be 256, as is the LSTM state size. We set the batch size to 128 and train the model up to 80 epochs.
As mentioned, there is a tradeoff between Coverage/Enrichment and Redundancy. To set up a more fair comparison for different models, we ask the control group to reach a comparable level of Redundancy, i.e., approximately 0.11-0.12 on WikiMovies-Synthetic and 0.26-0.27 on WikiMovies-Wikipedia. Keeping the Redundancy in around the same bucket, we compare their Coverage and Enrichment.

Results and Discussion
Let us first look at the performance on the Synthetic set in Table 3. GenQA is originally proposed to read only one single fact during decoding, so it is not surprising that it has the lowest answer coverage (38.92% C perf ect ) 2 www.tensorflow.org

HS-CumuAttn
Torn Curtain3 is a 19664 American political thriller film directed by Alfred Hitchcock5, starring Paul Newman2 and Julie Andrews0. Table 5: Example sentences generated by different models, where an underlined bold phrase is the value of a memory slot selected from the memory by its corresponding generation model, and its subscript number is the index of this slot in the memory. and information enrichment (0.1535). After splitting the decoder state, HS-GenQA obtains significant improvement in both coverage (50.10% C perf ect ) and enrichment (0.1952). When considering history for attention, GenQA-AttnHist achieves even better coverage ( +3.% in C part and +5% in C perf ect ). By combining these two mechanisms, HS-AttnHist achieves the best perfect coverage, 51.55%. Although CheckList is not originally designed for our task, it still gives a strong performance (50.04% C perf ect and 0.1963 enrichment), at a slightly lower redundancy (0.1176). Finally, our full model, HS-CumuAttn, achieves the best single coverage 98.15%, and comparable partial/perfect coverage, with the lowest redundancy (0.0983). Due to the lower level of redundancy, HS-CumuAttn does not include as much enrichment as other strong models, but still outperforms GenQA.  Table 6: Example answers generated by our model. In an answer sentence, an underlined phrase is the value of a memory slot selected from the memory by our model, and the subscript number is the index of this slot in the memory. Compared to vanilla GenQA, HS-GenQA splits the decoder states, thus improves the decoder's memory addressing process by performing attention separately, leading to improvements in both coverage and enrichment. Improvements of GenQA-AttnHist are of a different rationale. Looking at the history enables the decoder to avoid what are already said. Compared with HS-GenQA, GenQA-AttnHist improves Enrichment by avoiding repetition when introducing related information, while, HS-GenQA improves Enrichment by better memory addressing to select proper slots. Combining the two mechanisms to-gether gives HS-AttnHist the best performance in Enrichment. However, HS-AttnHist still suffers from the repetition issue, to certain extent. Because when choosing memory content, there is no explicit mechanism to help the decoder to avoid repetitions according to the history (left of Figure  4). Therefore, a generated word may still be chosen again at the memory addressing step, leaving all the burden of avoiding repetition to the generation step. Our Cumulative Attention mechanism is designed to utilize the context vector of the history to address the memory, thus helps avoid choosing those already mentioned slots at memory addressing time (right of Figure 4), leading to almost the best coverage with the lowest redundancy. Now we compare the three main models, GenQA, CheckList and our HS-CumuAttn on WikiMovies-Wikipedia (Table 4), which is admittedly more challenging than WikiMovies-Synthetic.
We skip the C single metrics here since most questions in WikiMovies-Wikipedia contain more than one answer word. It is not surprising that CheckList, with a lower redundancy, still outperforms GenQA in almost all metrics, except C perf ect , since CheckList is originally designed to perform well with larger agenda/memory and longer sentences. On the other hand, our model, HS-CumuAttn, achieves the best performance in all metrics. Although the BLEU score is not designed to fully reflect the naturalness, it still indicates that our model can output sentences that share more n-gram snippets with reference sentences and are more similar to those composed by humans. Table 5 provides the system outputs from different models for an example question. We can see that GenQA may lose track of the decoder history, and repeat itself (and and), because there is no explicit mechanism to help avoid repetition. Also, it lacks informativeness and may not utilize other information stored in the memory. CheckList keeps records of what have been said and what are left to mention, thus reaches a good answer coverage. But its decoder is unable to explicitly address separate components within one memory slot, so it may not realize that the two Julie Andrewss are essentially the same person. HS-CumuAttn is able to find all the answer words correctly and also include the director into the sentence. After generating Paul Newman, the Cumulative Attention mechanism enables the model to realize that Paul Newman in slot 2 has been said, and Paul Newman in slot 6 is the same as slot 2, so it should not choose the 6th slot again. Rather it should move to Julie Andrews. Although the decoder may figure out the two Paul Newman are the same during decoding, the Cumulative Attention can explicitly help make the clarification during memory addressing. Intuitively, the attention across memory and history induces a stronger signal for the decoder to gather the right information. Table 6 lists more typical imperfect output from our model. In question 1, there is considerable redundancy in the memory, but our decoder is still able to avoid repeatedly choosing the same entities from difference sources, though it produces a "_UNK" showing a slight incoherence. We think it comes from the gate g as it fails to decide that the current word should come from the memory. In question 2, the model correctly chooses the memory slot, but outputs the word "directed" while the correct word should be "written". This also shows an word choice inconsistency between the language model and the memory retrieval. Question 3 makes the same mistake, where it indeed chooses the right answer, but adds an incorrect word "written". We also observe a pair of additional parentheses, which are often used to accomodate movie tags, but we do not see any tags in this memory, so it has to be left blank. Question 4 shows an incorrect memory retrieval, where the decoder should have chosen slot 5 as the movie name. Question 5 is generally good enough, except the same parenthesis error as in question 4.

Case Study and Error Analysis
It is also interesting to see additional descriptions like "Australian", "supernatural" and "drama" in question 2, 3, and 5, introduced by the language model, rather than the memory. Although our model prevents repetition and obtains general naturalness, it cannot guarantee that the decoder can precisely use the right language to describe the memory information. We see the general readability of these sentences, yet they are still not as good as human composed ones. It is fairly subtle for the decoder to collaborate with the memory in different levels of semantics. The semantic coherency and word choice consistency is still a challenge in natural language generation.

Conclusion and Future Work
In this paper, we propose a novel mechanism within an encoder-decoder framework to enable the decoder to actively interact with a memory by taking its heterogeneity into account. Our solution can read multiple memory slots from different sources, attend across the generated history and the memory to explicitly avoid repetition, and enrich the answer sentences with related information from the memory. In the future, we plan to extend our work through 1) investigating more sophisticated structures in the memory such as knowledge graph, 2) solving more complex questions, such as those involving deep reasoning over multiple facts.