Answering Naturally: Factoid to Full length Answer Generation

In recent years, the task of Question Answering over passages, also pitched as a reading comprehension, has evolved into a very active research area. A reading comprehension system extracts a span of text, comprising of named entities, dates, small phrases, etc., which serve as the answer to a given question. However, these spans of text would result in an unnatural reading experience in a conversational system. Usually, dialogue systems solve this issue by using template-based language generation. These systems, though adequate for a domain specific task, are too restrictive and predefined for a domain independent system. In order to present the user with a more conversational experience, we propose a pointer generator based full-length answer generator which can be used with most QA systems. Our system generates a full length answer given a question and the extracted factoid/span answer without relying on the passage from where the answer was extracted. We also present a dataset of 315000 question, factoid answer and full length answer triples. We have evaluated our system using ROUGE-1,2,L and BLEU and achieved 74.05 BLEU score and 86.25 Rogue-L score.


Introduction
Factoid question answering (QA) is the task of extracting answers for a question from a given passage. These answers are usually short spans of text, such as named entities, dates, etc. Modern factoid QA systems which use machinecomprehension datasets, predict the answer span from relevant documents using encoder-decoder architectures with co-attention.
Conversely, knowledge-base (KB) oriented QA systems retrieve relevant facts using structured queries or neural representation of the question. Formulating the retrieved factoid answer into a full-length System Input: Question : When were the normans in normandy?
Factoid Answer : 10th and 11th centuries System Output : During the 10th and 11th centuries , the normans were in normandy. Table 1: Full-length natural answer generation from the question and the factoid answer natural sentence is, hence, a natural extension and post-processing step of any QA system.
A simple approach for this task might be to use hand-crafted rules to restructure the question into a declarative statement as described in (Jurafsky and Martin, 2018). However, such rule based approaches fail when the extracted answer span, contains words from the question or when there are multiple independent clauses and the system has to choose words specific to the question to formulate the answer. This leads to unnatural repetition of words in the full-length answer or grammatically incorrect sentence formulation.
On the other hand, neural-network based approaches in modern dialogue systems use end-toend encoder-decoder architectures to convert an abstract dialogue action into natural language utterances. Such modern task-oriented dialogue systems usually learn to map dialogue histories to system response. Non-task oriented dialogue systems such as generative systems can formulate responses not present in the training data but lacks the capability to incorporate factual information without external knowledge bases.
Unlike conversational chat-bots designed to mimic human conversation without the need to be factually correct, or task-oriented dialogue systems which place the retrieved answer in a predefined template, our system automatically gener-ates accurate full-length answers, thereby, enhancing the system's usage in these situations. Table 1 shows a sample of our system input and output. Our system can be used in any such task-specific scenarios where natural answers are desired, without being restricted to a limited set of templates.
Our overall research contributions are listed as follows: • We introduce a system which generates factually correct full-length answers from the questions and the factoid answers. Our system can be used as a post-processing plugin to any QA system, be it a KB-based system or machine comprehension based system, thereby improving readability of the system output and promoting fluency and variation in the natural answer generation.
• We have also released a dataset comprising of tuples of questions, factoid answers and full-length answers which can be further augmented using any other QA datasets using the techniques we describe in section 3.1.

Related Work
There has been a lot of interest recently in QA and task-oriented dialogue systems. End-to-end memory networks (Sukhbaatar et al., 2015) use a language modelling architecture which learns query embeddings in addition to input and output memory representations from source sequences and predicts an answer. Rule based systems such as  sets up a variety of tasks for inferring and answering the question. (Bordes and Weston, 2016) improves on the memory networks and handles out-of-vocabulary (OOV) words by inserting special words into the vocabulary for each knowledge base entity types. These systems are dependent on templates or special heuristics to reproduce facts. We demonstrate through our baseline model that generating template-like sentences from factual input can be achieved with limited success. Recent works on KB-based end-to-end QA systems such as (Yin et al., 2015;He et al., 2017a;Liu et al., 2018a) generate full-length answers with neural pointer networks (Gülçehre et al., 2016;Vinyals et al., 2015;He et al., 2017b) after retrieving facts from a knowledge base (KB). Dialogue systems such as (Liu et al., 2018b;Lian et al., 2019) extract information from knowledge bases to formulate a response. Systems such as (Fu and Feng, 2018) uses KB based key-value memory after extracting information from documents or external KBs. However, these systems are restricted to only information modeled by the KB or slot-value memory. Our system, is generic and can be used with any knowledge source, structured such as a knowledge base or free form such as machine-comprehension dataset. Since our system doesn't use any additional relational information as modelled in a KB, it is invariant to the type of dataset. The pointer generator network, introduced in (See et al., 2017), is a generative summarization model that can copy out-of-vocabulary (OOV) words from a source sequence. Our work is inspired from the ability of this network to accurately reproduce information from source.
To the best of our knowledge, there is no existing QA data-set which addresses the task directly. However, Knowledge-based QA dataset such as (Yin et al., 2015) creates a knowledgebase from Chinese websites and extracts questionanswer pairs from Chinese communityQA webpage. The system built over this dataset, is able to generate natural answers to simple questions. The recently released CoQA dataset (Reddy et al., 2018) is an abstractive conversational question answering dataset through which the system generates free-form answers from the whole conversational history using the aforementioned pointergenerator network. While the CoQA challenge extracts free-form text from the passages, our system incorporates the structure of the question to give a full-length sentence as answer to the given query.

Data
Since there is no available dataset for the task, we used the standard machine comprehension datasets such as SQuAD (Rajpurkar et al., 2016) and HarvestingQA (Du and Cardie, 2018) to create auto-annotated data. This provide us with questions and factoid answers which we use as input to our system. For the ground-truth, we automatically extract full-length answers from the passages of these datasets by applying certain heuristics (explained in section 3.1). We extract ∼300,000 samples (question, factoid answer, full-length answer) from SQuAD and HarvestingQA. Additionally, we have manually annotated 15000 samples from SQuAD of which 2500 are used for development, 2500 for testing and we augment the rest 10000 with the auto-annotated data.

Automatic Data Generation
Creating datasets for any new task is a challenge since modern systems based on neural architectures requires a large amount of data to train. To make the data creation task scalable, most of our training data is automatically generated from SQuAD and HarvestingQA. For each questionanswer pair, we automatically extract the target full-length answers from corresponding passages. We iterate over the sentences in the context passage that contain the factoid answer and select the one that has the highest BLEU score with the question, given BLEU score ≥ 35%. Given the question-answer pair (Q, A) and the passage P , the full-length answer T is the sentence, S, in the passage: The target sentences having a low BLEU score(between 35%−50%) may not be completely aligned with the question but provide sufficient information to train the system to generate fulllength sentences containing the factoid answer. 1 As the whole sentence is extracted from the corresponding passage, these samples may also contain additional information from the passage which is not related to the question.
Our method of automatically extracting samples from existing QA datasets is scalable and can be reproduced with any modern QA datasets to generate more samples to augment our autogenerated samples extracted from HarvestingQA and SQuAD. The table 2 shows some autogenerated samples from the dataset. Our autogenerated data samples follow a similar question distribution as SQuaD and is biased towards what" and "who" questions as shown in the trigram distribution of the questions in figure 1.

Manual Data Generation
The auto-generated samples contain extra information in the ground-truth full-length sentences which are not aligned with the question or factoid answer. To refine our dataset to be more attuned to questions and also to capture the variabil-Question : what is the name of the term that is used in the united states ? Factoid : great plains Target : the term great plains is used in the united states to describe a sub-section of the even more vast interior plains physiographic division Question : who is the only country among the united nations security council ? Factoid : germany Target : germany is the only country among the top five arms exporters that is not a permanent member of the united nations security council . Question : what lake is now connected to the sea ? Factoid : lake voulismeni Target : lake voulismeni at the coast , at aghios nikolaos , was formerly a sweetwater lake but is now connected to the sea . Question : what is a bus driving on this route ? Factoid : the capacity of the lane will be more and will be more and will increase when the traffic level increases Target : when there is a bus driving on this route , the capacity of the lane will be more and will increase when the traffic level increases .  ity humans bring when generating new sentences, we manually annotated 15000 QA pairs, from the SQuAD dataset. We used multiple ways to answer the same question, such as in active and passive voice, to incorporate more variation to the target sentences. Apart from generating samples with the full-length answers well aligned with the question, we have also chosen complex samples from SQuAD which have long phrasal factoid answers to add more complexity to the data samples. These samples have sentential factoid answers containing more than one independent clause which are not present in the ground-truth full-length natural answer. The inclusion of such examples is to aid to the system to learn to only choose words which are required to form a syntactically correct answer and omit other synonymous or superfluous words. The table 3 shows some manual generated samples. The manual samples contains questions more evenly distributed than the auto-generated ones as shown in the figure 2 displaying the trigram distribution of questions.

System Architecture
We framed the problem of generating full-length answer from the question and the factoid answer into a Neural Machine Translation (NMT) task using two approaches. We built a model based on the pointer-generator architecture described in (See et al., 2017) except we use two encoders on the source side to encode question and factoid answer separately as shown in Figure 3.
Question : How much more were her earnings that the year before? Factoid : more than double her earnings Target 1 : Her earnings were more than double than that of the year before. Target 2 : She earned more than double her earnings than that of the year before. Question : How many digital copies of her fifth album did Beyonc sell in six days? Factoid 1 : one million Factoid 2 : one million digital copies Target : Beyonc sold one million digital copies of her fifth album in six days. Question : How well did Kanye do in high school? Factoid : A's and B's Target : Kanye did well in high school by scoring A's and B's. Question : What do scholars recognize about the life of the Buddha? Factoid : Most accept that he lived, taught and founded a monastic order Target : Most scholars recognize and accept that Buddha lived, taught and founded a monastic order. Question : Where did english and scotch irish descent move to florida from? Factoid : English descent and americans of scots-irish descent began moving into northern florida from the backwoods of georgia and south carolina Target : English and Scotch Irish descent moved to Florida from the backwoods of Georgia and South Carolina. Let the question be represented by words Q = {q 1 , q 2 , ..., q n }. Let the factoid answer be represented by words A = {a 1 , a 2 , a 3 , ..., a m }.
We encode the question and answer sequence using two 3-layered bidirectional LSTMs which share weights. This produces two sequences of hidden states We choose to encode the source sequences separately, since there is no syntactic connection between the question and the factoid answer. We Following the global attention mechanism described in (Luong et al., 2015), context vector, C t , is generated. For each decoder state, h t T , at time t, the alignment score, a(h t T , h i S ), with each encoder state, h i S , is calculated as follows: The challenge to correctly reproduce factual information in the full-length answer led us to use copy attention from the pointer generator network as described in (See et al., 2017). The copy distribution, using an extended vocabulary comprising of source words, will capture the probability of replicating words from either the question or answer, whereas the global attention distribution has the ability to generate new words from the vocabulary. The final probability of predicting a word is as follows: P (W f inal ) = p g P gen + (1 − p g )P copy The parameter, p g , is learned as where C t is the context vector and X t is the input to the decoder. We calculate the copy distribution, a distribution over the source words, w = Q ∪ A: The final probability of generating a word is as shown in equation 6. For out-of-vocabulary words which are present only in the source w ∈ (Q ∪ A) and w / ∈ V , only P copy is used predict the word. These words are usually factual information from the question or answer, such as dates and named entities and hence needs to be copied exactly as it appears in the source sequences. Prepositions, conjunctions and other placeholders, such as at, between, in, which help in combining the question and answer sequences are usually in-vocab words not present in the source (w / ∈ (Q ∪ A) and w ∈ V ), and are predicted with P gen . For invocabulary words which are present in the source, w ∈ (Q ∪ A) and w ∈ V, the final probability of predicting the word uses both the terms of equation 6.

Experiments
For all our experiments, we used a 6GB 1060TX Nvidia GPU. We trained the system on batch size of 32, dropout rate of 0.5, RNN size of 512 and decay steps 10000. Since, our dataset is small, we shared the vocabulary between source and target. We used pre-trained GloVe embeddings (300 dimension) to initialize both the encoder and decoder words. Since our manually created samples are less, we oversampled the manually annotated data 3 times to mitigate any bias introduced by the synthetic dataset. We have built our system over the OpenNMT-pytorch code base (Klein et al., 2017). We have tested our models independently on both the manual dataset and autocreated dataset. We have used 2500 samples of the manually annotated SQuAD data set and 3284 samples of the auto-generated dataset to evaluate the models' performance. These samples were selected randomly from the respective datasets. To evaluate the effectiveness of the manual data samples, we have compared the performance of our 2-encoder pointer-generator network trained on the auto-generated data and on the whole augmented dataset, containing both the manual and auto-generated data. For this comparison, training on the whole augmented data instead of only the   manual data is required due to the limited number of samples(15000) of the manual annotated data.
We have compared our system with a Seq2Seq model with attention where only the question and full-length answer are considered as source and target to the model respectively. We mask the factoid answer in the target full-length answer with the string a-n-s-w-e-r. The mask, which acts as a placeholder to the factoid answer, is replaced with the actual factoid answer in a post-processing step. The masking in the data copes with the named entities and other OOV words in the dataset.
We have also performed cross-dataset evaluation on a knowledge base dataset(Freebase) and a machine comprehension dataset(NewsQA) to test the generalization capability of our system. We randomly selected 900 samples, comprising of question and object-names(factoid answers), from the test samples provided by SimpleQA (Golub and He, 2016) which were extracted from the KB dataset Freebase (Bollacker et al., 2008). We also randomly extract 500 test samples, questions and factoid answers, from the machine comprehension NewsQA (Trischler et al., 2017) dataset. The system predictions were compared with the manually annotated ground-truth full-length answers for these samples.

Model
Training Dataset Acc 2-Enc Pointer-Gen Synthetic-only 83.4 2-Enc Pointer-Gen Augmented 92.8 Table 6: Accuracy Scores(in the range of 0-100) for the various models

Results
As shown in table 4, 5, 6 and 7, augmenting the manually annotated data with the auto-generated data for training leads to significant improvements for the 2-encoder pointer generator network. From our best assumption, this is not only due to cleaner samples in the manually annotated data which  does not contain extra unnecessary information, but also samples with variations in the factoid and full-length groundtruth. The manual data also has long phrasal factoid answers from which the system has to learn to copy and generate words as needed. Table 7 shows that the pointer-generator system handles tense agreement and generation of new words. The Seq2Seq model suffers to capture contextual information, resolve anaphora, or reproduce factual information and handle out-ofvocabulary words. As shown in table 7, noncontiguous factoid answers are not interleaved in the full-length sentence predictions as expected. The pointer-generator network is able to handle these issues. The BLEU and ROGUE scores are better on the auto-generated test data as it lacks the variation and complexity in the full-length answers compared to the manually created dataset. The validation accuracy of the 2-encoder pointer generator network as shown in the figure on the development dataset also shows significant improvement from the start of the training, with the augmented dataset providing significant increase in accuracy as shown in figure 4. The performance of our models on a KB dataset such as SimpleQA and a machine comprehension dataset such as NewsQA is shown in the table 5. As observed from the BLEU and ROGUE scores, the augmented dataset improves performance across these datasets and provide better generalization capability to the system. Some of the failure cases of the system can be observed in the table 8.

Conclusion
In this work, we have introduced the task of generating full-length natural answers given the question and the factoid answer. We framed the problem into an NMT task using two different approaches. Our approach uses a 2-encoder pointer generator model, where factoid answers along with the questions are inputs to the system and the full-length answers for training and is better than the baseline model for both the BLEU and ROGUE scores. Additionally, as there were no datasets which directly address this task, we released a new dataset containing tuples of questions, factoid answers, and full-length answers of which 300,000 samples were automatically extracted and 15000 samples were manually annotated. Our automatic dataset creation approach is scalable and can be used over any other QA datasets to retrieve more samples. We have provided the additional manually annotated clean samples to introduce complexity and variation in Question : what kind of metal is on handful of rain? Factoid Answer : heavy metal Target : on handful of rain is heavy metal . Modified PointerGen : heavy metal is on handful of rain. Question : Name an actor. Factoid Answer : Collien Ulmen-Fernandes Target : collien ulmen-fernandes is an actor. Modified PointerGen : collien ulmenfernandes . Question : Will the 10 be punished? Factoid Answer : no one should Target : no one should be punished. Modified PointerGen : the 10 be punished no one should punished. Question : in which country the construction of the mosque is Factoid Answer : turkey Target : the construction of the mosque is in turkey . Modified PointerGen : in turkey . Table 8: Failure Cases. Example 1 is from the Freebase dataset where the system confuses between the subject and the object. Example 2 is from Freebase not present in the training and validation data. Example 3 is from NewsQA dataset where the system fails to understand the semantics. Example 4 id from NewsQA dataset where the system fails to generate the complete full-length answer the training data. We have performed cross-dataset evaluation by testing on a KB dataset(Freebase) and a machine comprehension dataset(NewsQA) to test the generalization capability of our system.

Future Work
For a deep learning model to generalize well with greater accuracy, a larger dataset comprising of a bigger vocabulary and sample size is required. Due to the limited data provided, even though our system handles tense agreement, there are instances where it fails to predict the correct tense for the verb. We plan on adding more variation to the data by annotating additional QA and machine comprehension datasets. Additionally, there is no explicit co-reference resolution module in our model. Further work needs to be done using state of the art architectures which can handle such cases and improve results. Augmenting our full-length natural answer generation system with a question answering module or a knowledge-base will provide insights into how the system performs with noisy and incorrect factoid answers. This needs to be explored further.