Asking Clarification Questions in Knowledge-Based Question Answering

The ability to ask clarification questions is essential for knowledge-based question answering (KBQA) systems, especially for handling ambiguous phenomena. Despite its importance, clarification has not been well explored in current KBQA systems. Further progress requires supervised resources for training and evaluation, and powerful models for clarification-related text understanding and generation. In this paper, we construct a new clarification dataset, CLAQUA, with nearly 40K open-domain examples. The dataset supports three serial tasks: given a question, identify whether clarification is needed; if yes, generate a clarification question; then predict answers base on external user feedback. We provide representative baselines for these tasks and further introduce a coarse-to-fine model for clarification question generation. Experiments show that the proposed model achieves better performance than strong baselines. The further analysis demonstrates that our dataset brings new challenges and there still remain several unsolved problems, like reasonable automatic evaluation metrics for clarification question generation and powerful models for handling entity sparsity.


Introduction
Clarification is an essential ability for knowledgebased question answering, especially when handling ambiguous questions (Demetras et al., 1986). In real-world scenarios, ambiguity is a common phenomenon as many questions are not clearly articulated, e.g. "What are the languages used to create the source code of Midori?" in Figure 1. There are two "Midori " using different programming languages, which confuses the system. For these ambiguous or confusing questions, it is hard to directly give satisfactory responses unless systems can ask clarification questions to confirm the participant's intention. Therefore, this work explores how to use clarification to improve current KBQA systems.
We introduce an open-domain clarification corpus, CLAQUA, for KBQA. Unlike previous clarification-related datasets with limited annotated examples (De Boni and Manandhar, 2003;Stoyanchev et al., 2014) or in specific domains Rao and III, 2018), our dataset covers various domains and supports three tasks. The comparison of our dataset with relevant datasets is shown in Table 1. Our dataset considers two kinds of ambiguity for single-turn and multi-turn questions. In the single-turn case, an entity name refers to multiple possible entities in a knowledge base while the current utterance lacks necessary identifying information. In the multi-turn case, ambiguity mainly comes from the omission where a pronoun refers to multiple possible entities from the previous conversation turn. Unlike CSQA Dataset Domain Size Task De Boni and Manandhar (2003) Open domain 253 Clarification question generation. Stoyanchev et al. (2014) Open domain 794 Clarification question generation.  Movie 180K Learning to generate responses based on previous clarification questions and user feedback in dialogue. Guo et al. (2017) Synthetic 100K Learning to ask clarification questions in reading comprehension. Rao and III (2018) Operating system 77K Ranking clarification questions in an online QA forum, StackExchange 2 .

CLAQUA (Our dataset)
Open domain 40K Clarification in KBQA, supporting clarification identification, clarification question generation, and clarification-based question answering. dataset (Saha et al., 2018) that constructs clarification questions base on predicate-independent templates, our clarification questions are predicateaware and more diverse. Based on the clarification raising pipeline, we formulate three tasks in our dataset, including clarification identification, clarification question generation, and clarificationbased question answering. These tasks can naturally be integrated into existing KBQA systems. Since asking too many clarification questions significantly lowers the conversation quality, we first design the task of identifying whether clarification is needed. If no clarification is needed, systems can directly respond. Otherwise, systems need to generate a clarification question to request more information from the participant. Then, external user feedback is used to predict answers.
Our main contribution is constructing a new clarification dataset for KBQA. To build highquality resources, we elaborately design a data annotation pipeline, which can be divided into three steps: sub-graph extraction, ambiguous question annotation, and clarification question annotation. We design different annotation interfaces for single-turn and multi-turn cases. We first extract "ambiguous sub-graphs" from a knowledge base as raw materials. To enable systems to perform robustly across domains, the sub-graphs are extracted from an open-domain KB, covering domains like music, tv, book, film, etc. Based on the sub-graphs, annotators are required to write ambiguous questions and clarification questions.
We also contribute by implementing representative neural networks for the three tasks and further developing a new coarse-to-fine model for clarification question generation. Take multi-turn as an example. In the task of clarification identifica-tion, the best performing system obtains an accuracy of 86.6%. For clarification question generation, our proposed coarse-to-fine model achieves a BLEU score of 45.02, better than strong baseline models. For clarification-based question answering, the best accuracy is 74.7%. We also conduct a detailed analysis for three tasks and find that our dataset brings new challenges that need to be further explored, like reasonable automatic evaluation metrics and powerful models for handling the sparsity of entities.

Data Collection
The construction process consists of three steps, sub-graph extraction, ambiguous question annotation, and clarification question annotation. We design different annotation interfaces for single-turn and multi-turn cases.

Single-Turn Annotation
Sub-graph Extraction. As shown in Figure 2, we extract ambiguous sub-graphs from an opendomain knowledge base 3 , like FreeBase. For simplification, we set the maximum number of ambiguous entities to 2. In single-turn cases, we focus on shared-name ambiguity where two entities have the same name and there is a lack of necessary distinguishing information. To construct such cases, we extract two entities sharing the same entity name and the same predicate. Predicate represents the relation between two entities. The subgraphs provide reference for human annotators to write diverse ambiguous questions based on the shared predicates. Computer.software. language_used Figure 2: An extracted ambiguous sub-graph in the single-turn case. Two entities share the same name "Midori" and the same predicate "Computer.software,language used".

Shared Name
Midori Predicate # 1 computer.software.language used Predicate Description A property relating an instance of computer.software to a programming language that was used to create the source code for the software. Qa: What are the languages used to create the source code of Midori? Ambiguous Question Annotation. In this step, the input is a table listing the shared entity name, predicate name, and predicate description. Based on this table, annotators need to write ambiguous questions, e.g., "What are the languages used to create the source code of Midori?". An annotation example is shown in Table 2. For diversity, annotators are encouraged to paraphrase the intention words in the predicate description.
Clarification Question Annotation. Based on entities and the annotated ambiguous question, annotators are required to summarize distinguishing information and write a multi-choice clarification question with a special marker for separating entity information and pattern information, e.g., "When you say the source code language used in the program Midori, are you referring to [web browser Midori] or [operating system Midori]?". Annotators are asked to write predicateaware clarification questions instead of general questions, like "Which one do you mean, A or B?" or "do you mean A?" as adopted in (Saha et al., 2018). An example is shown in Table 3.
We use multi-choice as our basic type of clarification. There are three possible clarification question types, including zero-choice type (e.g., I don't understand, can you provide more details? ), single-choice type (e.g., Do you mean A? ), and multi-choice type (e.g., Which one do you mean, A or B? ). The zero-choice type means that the system does not understand the question and expects  more details from the participant. Though simple, it costs more user efforts as it pushes the participant to figure out what confuses the system. In comparison, the advantage of single-choice type lies in the control of conversation direction. However, if the system cannot provide user-expected choices, conversation may be longer. As for the multi-choice type, its advantage is less conversation turns to ask for more valuable information while it requires more annotation work.

Multi-Turn Annotation
Sub-graph Extraction. In multi-turn cases, ambiguity comes from the omission of the target entity name from the previous conversation turn. We first extract two connected entities. If they also share another predicate, we extract all related information as an ambiguous sub-graph, as shown in Figure 3. By asking the shared predicate, we get an ambiguous question.

E1 Name
Lineodes caracasia E2 Name Lineodes Predicate #1 biology.higher classification Predicate Description Property relating an biology.organism classification to its higher or parent biology.organism classification. Predicate #2 biology.organism classification Predicate Description Relating an biology.organism to its biology.organism classification, the formal or informal taxonomy grouping for the organism. Q h : What is the higher classification for Lineodes caracasia? R h : Lineodes. Qa: What is the biological classification? Lineodes.

Qa
What is the biological classification? Qc: Are you referring to [Lineodes caracasia] or Lineodes], when you ask the biological classification? Table 5: A clarification question annotation example in the multi-turn case.

Ambiguous Question
Annotation. An annotation example is shown in Table 4. Based on two entities in the extracted sub-graph, annotators construct a conversation turn and then write an ambiguous question where the entity name is omitted, e.g, "What is the biological classification?".
Clarification Question Annotation. The annotation guideline of this step is the same as in single-turn annotation. The input includes two candidate entities and the annotated ambiguous question. The output is a clarification question. An annotation example is shown in Table 5.

Tasks
Our dataset aims to enable systems to ask clarification questions in open-domain question answering. Three tasks are designed in this work, includ-  ing clarification identification, clarification question generation, and clarification-based question answering. Figure 4 shows the distribution of the top-10 domains in our dataset.

Clarification Identification
Clarification identification can be regarded as a classification problem. It takes conversation context and candidate entity information as input. The output is a label identifying whether clarification is needed. Specifically, the input is {Q a , E 1 , E 2 } for single-turn cases or {Q p , R p , Q a , E 1 , E 2 } for multi-turn cases. Q a represents the current question. E 1 and E 2 represent the candidate entities. Q p and R p are question and response from the previous conversation turn. The output is a binary label from set Y = {0, 1} where 1 indicates that the input question is ambiguous and 0 indicates the opposite. As Figure 5 shows, negative examples are annotated in the same way as the positive (ambiguous) examples do, but without the clarification related steps. In their sub-graphs, one of the entities has its unique predicate. The unique predicates are included in the user questions like "Where is its mouth place?". As only river "Nile" has mouth places, this question is unambiguous.

Single-Turn
Multi- Turn  Positive Negative  Positive Negative  Train  8,139  6,841  12,173  8,289  Dev  487  422  372  601  Test  637  673  384  444   Table 6: Statistics of the dataset. It is important to note that three tasks share the same split. Since clarification question generation and clarification-based question answering are built upon ambiguous situations, they only use the positive data in their training, development, and test sets.
The statistics of our dataset are shown in Table 6. It is important to note that three tasks share the same split. Since clarification question generation and clarification-based question answering are built upon ambiguous situations, they only use the positive (ambiguous) data in their training, development, and test sets. For generalization, we add some examples with unseen entities and predicates into the development and test sets.

Clarification Question Generation
This text generation task takes ambiguous context and entity information as input, and then outputs a clarification question. For single-turn cases, the input is {Q a , E 1 , E 2 }. For multi-turn cases, the input is {Q p , R p , Q a , E 1 , E 2 }. In both cases, the output is a clarification question Q c . We use BLEU as the automatic evaluation metric.

Clarification-Based Question Answering
As our dataset focuses on simple questions, this task can be simplified as the combination of entity identification and predicate identification. Entity identification is to extract entity e from candidate entities based on current context. Predicate identification is to choose predicate p that describes the ambiguous question. The extracted entity e and predicate p are combined to query the knowledge triple (e, p, o). The model is correct when both tasks are successfully completed.
Additional user responses toward clarification questions are also necessary for this task. Due to the strong pattern in responses, we use a templatebased method to generate the feedback. We first design four templates, "I mean the first one", "I mean the second one", "I mean [entity name]", "I mean [entity type] [entity name]". Then for each clarification example, we randomly select a candidate entity and fill its information into the template as the user response, e.g., "I mean web browser Midori." for the clarification question in Table 3.
For entity identification, we use the selected entity as the gold label. For predicate prediction, we use the predicate in the current ambiguous question as the gold label. In muli-turn cases, the predicate label is Predicate #1 as shown in Table 2. In singleturn cases, the predicate label is Predicate #2 as shown in Table 4.
Entity identification and predicate identification are both classification tasks. They take the same information as input, including ambiguous context, clarification question, and user response.
For single-turn cases, the input is {Q a , E 1 , E 2 , Q c , R c } where R c is the feedback of the participant toward clarification question Q c . For multi-turn cases, the input is {Q p , R p , Q a , E 1 , E 2 , Q c , R c }. For entity identification, the output is a label from set Y E = {E 1 , E 2 }. For predicate identification, we randomly choose 1 negative predicates from the training set and combine them with the gold predicate together as the candidate set Y P . The output of predicate identification is the index of the gold predicate in Y P .

Approach
For three tasks, we implement several representative neural networks as baselines. Here we introduce these models in detail.

Classification Identification Models
The input of classification models is the ambiguous context and entity information. For simplification, we use a special symbol, [S], to concatenate all the entity information together. These two parts can be regarded as different source information. Based on whether the inter-relation between different source information is explicitly explored, the classification models can be classified into two categories, unstructured models and structured models.
Unstructured models concatenate different source inputs into a long sequence with a special separation symbol [SEL]. We implement several widely-used sequence classification models, including Convolutional Neural Network (CNN) (Kim, 2014), Long-Short Term Memory Network (LSTM) (Schuster and Paliwal, 1997), and Recurrent Convolutional Neural Network (RCNN) (Lai et al., 2015), Transformer. The details of the models are shown in Supplementary Materials.  Structured models use separate structures to encode different source information and adopt an additional structure to model the inter-relation of the source information. Specifically, we use two representative neural networks, Hierarchical Attention Network (HAN) (Yang et al., 2016) and Dynamic Memory Network (DMN) (Kumar et al., 2016), as our structured baselines. The details of the models are shown in Supplementary Materials.

Clarification Question Generation Models
The input of the generation model is the ambiguous context and entity information. In single-turn cases, the ambiguous context is current question Q a . In multi-cases, the input is current question and the previous conversation turn {Q p , R p , Q a }.
We use [S] to concatenate all entity information together, and use [SEL] to concatenate entity information and context information into a long sequence. We adopt Seq2Seq (Bahdanau et al., 2015) and Transformer (Vaswani et al., 2017) as our baselines and further develop a new generation model. The details of baselines can be found in Supplementary Materials.
Coarse-to-fine Model. Generally speaking, a clarification question contains two parts, entity phrases (e.g., "web browser Midori ") and pattern phrases ( e.g. "When you say the source code language used in the program Midori, are you referring to [A] or [B]?" ). The entity phrase is summarized from the given entity information for distinguishing between two entities. The pattern phrase is used to locate the position of ambiguity, which is closely related with the context. In summary, two kinds of phrases refer to different source information. Based on this feature, we propose a new coarse-to-fine model, as shown in Figure 6. Similar ideas have been successfully applied to semantic parsing (Dong and Lapata, 2018). The proposed model consists of a template generating module T θ and an entity rendering module R φ . T θ first generates a template containing pattern phrases and the symbolic representation of the entity phrases. Then, the symbolized entities contained in the generated template are further properly rendered by the entity rendering module R φ to reconstruct complete entity information. Since the annotated clarification questions explicitly separate entity phrases and pattern phrases, we can easily build training data for these two modules. For clarity, the template is constructed by replacing entity phrases in a clarification question with special symbols, [A] and [B], which represent the positions of two entities.
The template generating module T θ uses Transformer (Vaswani et al., 2017) as implementation. The input is the ambiguous context and the output is a clarification template.
Similar to T θ , the entity rendering module R φ is also implemented with a Transformer structure. More specifically, for each symbolized entity in the template, we add the hidden state of the decoder of T θ corresponding to this symbolized entity into the decoder input.

Clarification-Based Question Answering Models
This task contains two sub-tasks, entity identification and predicate identification. They share the same input. In single-turn cases, the input is {Q a , Q c , R c , E 1 , E 2 }. In multi-cases, the input is Considering these two sub-tasks both belong to classification tasks, we use the same models as used in the task of clarification identification for implementation.

Experiments
We report the experiment results on three formulated tasks and give a detailed analysis. More details of hyper-parameters are in Supplementary Materials.  Table 7: Results of clarification identification models.

Experiment Results
Clarification Identification. As shown in Table 7, the structured models have the best performance, with a 86.6% accuracy on single-turn cases and a 80.5% accuracy on multi-turn cases. Compared to the best-performing unstructured model, the structured architecture brings obvious performance improvements, with 2.6% and 6.7% accuracy increases. It indicates that the structured architectures have learned the inter-relation between the context and entity details, which is a vital part for reasoning whether a question is ambiguous. We also find that the models all achieve better results on multi-turn cases. Multi-turn cases contain additionally previous conversation turn, which can help the models capture more key phrases from the entity information.
Clarification Question Generation. Table 8 shows that Seq2Seq achieves low BLEU scores, which indicates its tendency to generate irrelevant text. Transformer achieves higher performance than Seq2Seq. Our proposed coarse-to-fine model demonstrates a new state of the art, improving the current highest baseline result by 3.35 and 0.60 BLEU scores, respectively. We also find that multi-turn cases generally obtain higher BLEU scores than single-turn cases. In single-turn cases, the same-name characteristic makes it hard to summarize key distinguishing information. As a comparison, in multi-turn cases, two candidate entities usually have different names and it is easier to generate clarification questions. It is due to the flexible structure of ambiguous questions, which makes it hard to define universal rules to identify the asked objects. Table 9 shows a an example generated by the proposed model.
Clarification-Based Question Answering. As Table 10 shows, the unstructured models perform better than the structured models, indicating that this task tends to be over-fitting and small models have better the generalization ability.  What is higher classification of Lineodes interrupta?

Qa
Biologically speaking, what is the classification? Output: Are you referring to Lineodes interrupta or Lineodes, when you ask the biological classification? Table 9: An example of the generated clarification questions.

Discussion
Automatic Evaluation. In the task of clarification question generation, we use BLEU, a widely used evaluation metric on language generation tasks, as the automatic evaluation metric. BLEU evaluates the generated text by computing its similarity with the gold text. However, we find this metric not suitable for our task. A good answer can ask different aspects as long as it can distinguish the entities. The given answer in our dataset is one of possible answers but not the only correct answer. A good answer may get a low BLEU score. Therefore, a more reasonable evaluation metric needs to be explored in the future, like paraphrase-based BLEU.
Error Analysis. In clarification question generation, although our proposed model has achieved the best performance, it still generates some lowquality questions. We conduct a human evaluation on 100 generated clarification questions and summarize two error categories. The first one is the entity error where the generated clarification question has a correct multi-choice structure but with irrelevant entity information. Here we present one example in Table 11. In this example, our model generates entity phrases irrelevant to the input am-   biguous question. This failure is mainly due to the unseen entity name "Come Back, Little Sheba", which leads the model to wrong generation decisions. Since there are lots of entities with different names and descriptions, how to handle the sparse information is a non-trivial problem. The second one is the grammar error where models sometimes generate low-quality phrases after generating a low-frequency entity word. These errors both can be attributed to the sparsity of entities to some extent. How to deal with the sparsity of entity information still needs further exploration.

Related Work
Our work is related to clarification question and question generation.

Clarification Question in Other Tasks
There are several studies on asking clarification questions. Stoyanchev et al. (2014) randomly drop one phrase from a question and require annotators to ask a clarification question toward the dropped information, e.g."Do you know the birth data of XXX". However, the small dataset size makes it hard to help downstream tasks. Following this work, Guo et al. (2017) provide a larger synthetic dataset QRAQ by replacing some of entities with variables.  use mis-spelled words to replace entities in questions to build ambiguous questions. Although these studies are good pioneering studies, the synthetic constructing way makes them unnatural and far from real world questions.
Different from these studies, Braslavski et al. (2017) investigate a community question answering (CQA) dataset and study how to predict the specific subject of a clarification question. Similarly, Rao and III (2018) focus on learning to rank human-written clarification questions in an online QA forum, StackExchange. Our work differs from these two studies in that we have a knowledge graph at the backend, and the clarification-related components are able to facilitate KBQA.

Question Generation
For different purposes, there are various question generation tasks. Hu et al. (2018) aim to ask questions to play the 20 question game. Dhingra et al. (2017) teach models to ask questions to limit the number of answer candidates in task-oriented dialogues. Ke et al. (2018) train models to ask questions in open-domain conversational systems to better interact with people. Guo et al. (2018) develop a sequence-to-sequence model to generate natural language questions.

Conclusion and Future Work
In this work, we construct a clarification question dataset for KBQA. The dataset supports three tasks, clarification identification, clarification question generation, and clarification-based question answering. We implement representative neural networks as baselines for three tasks and propose a new generation model. The detailed analysis shows that our dataset brings new challenges. More powerful models and reasonable evaluation metrics need further explored.
In the future, we plan to improve our dataset by including more complex ambiguous situations for both single-turn and multi-turn questions, such as multi-hop questions, aggregation questions, etc. We also plan to integrate the clarification-based models into existing KBQA system, and study how to iteratively improve the model based on human feedback.   Table 12 shows the detailed hyper-parameter settings of these classification models. Here we introduce the classification models used in our experiments.
CNN. It first feeds the input embedding sequence into a convolutional layer. The convolutionial layer contains K filters f ∈ R s,d where s is the size of filters and d is the dimension of word embeddings. We use different sizes of filters to get rich features. Then, a max pooling layer takes K vectors as input and generates a distributed input representation h. Finally, the vector h is projected to the probability of labels.
RNN. It is built on a traditional bi-directional gated recurrent unit (GRU) structure, which is used to capture global and local dependencies inside the input sequence. The last output of GRU is then fed into the output layer to generate the probability of labels.  with a special attention mechanism. called multihead attention. The model is composed of a stack of 2 identical layers. Each identical layer has two sub-layers. The first sub-layer is a multi-head selfattention layer, and the second is a position-wise fully connected feed-forward layer. Since input nodes are not sequentially in order, the positional embedding layer is removed from the model. The decoder is also composed of a stack of 2 identical layers.
HAN. It is built upon two self-attention layers. The first layer is responsible for encoding context information and entity information. Then, the second layer uses a self-attention mechanism to learn the inter-relations between them. Finally, the output of the second layer is fed into the output layer for predicting labels.
DMN. This model regards context information as query and entity information as input to learn their inter-relations. It consists of four modules, an input module, a question module, an episodic memory module, and an answer module. The input module and thed question module encode query and input into distributed vector representations. The episodic memory module decodes which parts of the entity information to focus on through the attention mechanism. Finally, the answer module generates an answer based on the final memory vector of the memory module.

A.2 Clarification Question Generation Models
Here is the detailed introduction about the generation baselines.
Seq2Seq (Bahdanau et al., 2015). This model is based on a traditional encoder-decoder framework which encodes the input sequence into a dense vector and then decodes the target sequence word by word. Attention is used to capture global dependencies between input and output.
Transformer (Vaswani et al., 2017). It is based solely on attention mechanism to capture global dependencies. Similar to Seq2Seq, it uses an encoder-decoder framework but with different implementation. Table 13 shows the detailed hyper-parameter settings of generation models.