On Generating Characteristic-rich Question Sets for QA Evaluation

We present a semi-automated framework for constructing factoid question answering (QA) datasets, where an array of question characteristics are formalized, including structure complexity, function, commonness, answer cardinality, and paraphrasing. Instead of collecting questions and manually characterizing them, we employ a reverse procedure, ﬁrst generating a kind of graph-structured logical forms from a knowledge base, and then converting them into questions. Our work is the ﬁrst to generate questions with explicitly speciﬁed characteristics for QA evaluation. We construct a new QA dataset with over 5,000 logical form-question pairs, associated with answers from the knowledge base, and show that datasets constructed in this way enable ﬁne-grained analyses of QA systems. The dataset can be found in https://github.com/ ysu1989/GraphQuestions .


Introduction
Factoid question answering (QA) has gained great attention recently, owing to the fast growth of large knowledge bases (KBs) such as DBpedia (Lehmann et al., 2014) and Freebase (Bollacker et al., 2008), which avail QA systems of comprehensive and precise knowledge of encyclopedic scope (Yahya et al., 2012;Berant et al., 2013;Cai and Yates, 2013;Kwiatkowski et al., 2013;Berant and Liang, 2014;Fader et al., 2014;Reddy et al., 2014;Bao et al., 2014;Zou et al., 2014;Sun et al., 2015;Dong et al., 2015;Yao, 2015;. With the blossoming of QA systems, evaluation is becoming an increasingly important problem. QA datasets, consisting of a set of questions with ground-truth answers, are critical for both comparing existing systems and gaining insights to develop new systems. Questions have rich characteristics, constituting dimensions along which question difficulty varies. Some questions are difficult due to their complex semantic structure ("Who was the coach when Michael Jordan stopped playing for the Chicago Bulls?"), while some others may be difficult because they require a precise quantitative analysis over the answer space ("What is the best-selling smartphone in 2015?"). Many other characteristics shall be considered too, e.g., what topic a question is about (questions about common topics may be easier to answer) and how many answers there are (it is harder to achieve a high recall in case of multiple answers). Worse still, due to the flexibility of natural language, different people often describe the same question in different ways, i.e., paraphrasing. It is important for a QA system to be robust to paraphrasing.
A QA dataset explicitly specifying such question characteristics allows for fine-grained inspection of system performance. However, to the best of our knowledge, none of the existing QA datasets (Voorhees and Tice, 2000; Berant et al., 2013;Cai and Yates, 2013;Lopez et al., 2013;Bordes et al., 2015;Serban et al., 2016) provides question characteristics. In this work, we make the first attempt to generate questions with explicitly specified characteristics, and examine the impact of various question characteristics in QA.
We present a semi-automated framework (Figure 1) to construct QA datasets with characteristic specification from a knowledge base. The framework revolves around an intermediate graph query representation, which helps to formalize question characteristics and collect answers. We first automatically generate graph queries from a knowledge base, and then employ human annotators to convert graph queries into questions.
Automating graph query generation brings with it the challenge of assessing the quality of graph queries and filtering out bad ones. Our framework tackles the challenge by combining structured information in the knowledge base and statistical information from the Web. First, we identify redundant components in a graph query and develop techniques to remove them. Furthermore, based on the frequency of entities, classes, and relations mined from the Web, we quantify the commonness of a graph query and filter out too rare ones.
We employ a semi-automated approach for the conversion from graph query to natural language question, which provides two levels of paraphrasing: Common lexical forms of an entity (e.g., "Queen Elizabeth" and "Her Majesty the Queen" for ElizabethII) mined from the Web are used as entity paraphrases, and the remaining parts of a question are paraphrased by annotators. As a result, dozens of paraphrased questions can be produced for a single graph query.
To demonstrate the usefulness of question characteristics in QA evaluation, we construct a new dataset with over 5,000 questions based on Freebase using the proposed framework, and extensively eval-uate several QA systems. A couple of new findings about system performance and question difficulty are discussed. For example, different from the results based on previous QA datasets , we find that semantic parsing in general works better than information extraction on our dataset. Information extraction based QA systems have trouble dealing with questions requiring aggregation or with multiple answers. A holistic understanding of the whole question is often needed for hard questions. The experiments point out an array of issues that future QA systems may need to solve.

Related Work
Early QA research has extensively studied problems like question taxonomy, answer type, and knowledge sources (Burger et al., 2001;Hirschman and Gaizauskas, 2001;Voorhees and Tice, 2000). This work mainly targets factoid questions with one or more answers that are guaranteed to exist in a KB.
A few KB-based QA datasets have been proposed recently. QALD (Lopez et al., 2013) and FREE917 (Cai and Yates, 2013) contain hundreds of hand-crafted questions. QALD also indicates whether a question requires aggregation. Both based on single Freebase triples, SIMPLEQUES-TIONS (Bordes et al., 2015) employ human annotators to formulate questions, while Serban et al. (2016) use a recurrent neural network to automatically formulate questions. They are featured by a large size, but the questions only concern single triples, while our framework can generate ques-tions involving multiple triples and various functions. Wang et al. (2015) generate question-answer pairs for closed domains like basketball. They also first generate logical forms (λ-DCS formulae (Liang, 2013) in their case), and then convert logical forms into questions via crowdsourcing. Logical forms are first converted into canonical questions to help crowdsourcing workers. Different from previous works, we put a particular focus on generating questions with diversified characteristics in a systematic way, and examining the impact of different question characteristics in QA.
Another attractive way for QA dataset construction is to collect questions from search engine logs (Bendersky and Croft, 2009). For example, WEBQUESTIONS (Berant et al., 2013) contains thousands of popular questions from Google search, and Yih et al. (2016) have manually annotated these questions with logical forms. However, automatic characterization of questions is hard, while manual characterization is costly and requires expertise. Moreover, users' search behavior is shaped by search engines (Aula et al., 2010). Due to the inadequacy of current search engines to answer advanced questions, users may adapt themselves accordingly and mostly ask simple questions. Thus questions collected in this way, to some extent, may still not well reflect the true distribution of user information needs, nor does it fully exploit the potential of KBbased QA. Collecting answers is yet another challenge for this approach. Yih et al. (2016) show that only 66% of the WEBQUESTIONS answers, which were collected via crowdsourcing, are completely correct. On the other hand, although questions generated from a KB may not follow the distribution of user information needs, it has the advantage of explicit question characteristics, and enables programmatic configuration of question generation. Also, answer collecting is automated without involving human labor and errors.

Knowledge Base
In this work, we mainly concern knowledge bases storing knowledge about entities and relations in the form of triples (simply knowledge bases hereafter). Suppose E is a set of entities, L a set of literals (I = E ∪ L is also called individuals), C a set of classes, and R a set of directed relations, a knowledge base K consists of two parts: an ontology O ⊆ C × R × C and a model M ⊆ E × R × (C ∪ E ∪ L). In other words, an ontology specifies classes and relations between classes, and a model consists of facts about individuals. Such knowledge bases can be naturally represented as a directed graph, e.g., Figure 1(a). Literal classes such as Datetime are represented as diamonds, and other classes are rounded rectangles. Individuals are shaded. We assume relations are typed, i.e., each relation is associated with a set of domain and range classes. Facts of a relation must be compatible with its domain and range constraints. Without loss of generality, we use Freebase (June 2013 version) in this work for compatibility with the to-be-tested QA systems. It has 24K classes, 65K relations, 41M entities, and 596M facts.

Graph Query
Motivated by the graph-structured nature of knowledge bases, we adopt a graph-centric approach. We hinge on a formal representation named graph query (e.g., Figure 1(c)), developed on the basis of Yih et al. (2015) and influenced by λ-DCS (Liang, 2013). Syntax. A graph query q is a connected directed graph built on a given knowledge base K. It comprises three kinds of nodes: (1) Question node (double rounded rectangle), a free variable. (2) Ungrounded node (rounded rectangle or diamond), an existentially quantified variable. (3) Grounded node (shaded rounded rectangle or diamond), an individual. In addition, there are functions (shaded circle) such as < and count applied on a node. Nodes are typed, each associated with a class. Nodes are connected by directed edges representing relations. Entities on the grounded nodes are called topic entities. Semantics. Graph query is a strict subset of λcalculus. For example, the graph query in Figure 1(c) can be written in λ-calculus (an existentially quantified variable is imposed by <): 1960).

564
The answer of a graph query q, denoted as q K , can be easily obtained from K. For example, if K is stored in a RDF triplestore, then q can be automatically converted into a SPARQL query and run against K to get the answer. Compared with , graph queries are not constrained to be tree-structured, which grants us a higher expressivity. For example, linguistic phenomena like anaphora (e.g., Figure 1(d)) become easier to model.

Automatic Graph Query Generation
Our framework proceeds as follows: (1) Generate query templates from a knowledge base, ground the templates to generate graph queries, and collect answers (this section).
We now describe an algorithm to generate the query template shown in Figure 1(b) (excluding the function for now). For simplicity, we will focus on the case of a single question node. Nevertheless, the proposed framework can be extended to generate graph queries with multiple question nodes. The algorithm takes as input an ontology (Figure 1(a)) and the desired number of edges. All the operations are conducted in a random manner to avoid systematic biases in query generation. The DeceasedPerson class is first selected as the question node. We then iteratively grow it by adding neighboring nodes and edges in the ontology. In each iteration, an existing node is selected, and a new edge, which might introduce a new node, is appended to it. For example, the relation causeOfDeath, whose domain includes DeceasedPerson, is first appended to the question node, and then one of its range classes, CauseOfDeath, is added as a new node. When a node with the class CauseOfDeath already exists, it is possible to add an edge without introducing a new node. The same relation or class can be added multiple times, e.g., "parent of parent".
Topic entities like LungCancer play an important role in a question. A query template contains some template nodes that can be grounded with different topic entities to generate different graph queries. We randomly choose a few nodes as template. It may cause problems. For example, ground- ing one node may make some others redundant. We conduct a formal study on this in Section 5.1.
Functions such as counting and comparatives are pervasive in real-life questions, e.g., "how many", "the most recent", and "people older than", but are scarce in existing QA datasets. We incorporate functions as an important question characteristic, and consider nine common functions, grouped into three categories: counting (count), superlative (max, min, argmax, argmin), and comparative (>, ≥, <, ≤). More functions can be incorporated in the future. See Appendix A for examples. We randomly add functions to compatible nodes in query templates. In the running example, the < function imposes the constraint that only people who passed away before a certain date should be considered. Each query will have at most one function.
We then ground the template nodes with individuals to generate graph queries. A grounding is valid if the individuals conform with the class of the corresponding template nodes, and the resulted answer is not empty. For example, by grounding CauseOfDeath with LungCancer and Datetime with 1960, we get the graph query in Figure 1(c). A query template can render multiple groundings.
Finally, we convert a graph query into a SPARQL query and execute it using Virtuoso Open-Source 7 to collect answers. We further impose mutual exclusivity in SPARQL queries, that is, the entities on any two nodes in a graph query should be different. Consider the example in Figure 2, which is asking for the siblings of Natasha Obama. Wihout mutual exclusivity, however, Natash Obama herself will also be included as an answer, which is not desired.

Query Redundancy and Minimization
Some components (nodes and edges) in a graph query may not effectively impose any constraint on the answer. The query in Figure 3(a) is to "find the US president whose child is Natasha Obama, and Natasha Obama was born on 2001-06-10". Intuitively, the bold-faced clause does not change the answer of the question. Correspondingly, the dateOfBirth edge and the date node are redundant. As a comparison, removing any component from the query in Figures 3(b) will change the answer. Formally, given a knowledge base K, a component in a graph query q is redundant iff. removing it does not change the answer q K .
Redundancy can be desired or not. In a question, redundant information may be inserted to reduce ambiguity. In Figure 3(a), if one uses "Natasha" to refer to NatashaObama, there comes ambiguity since it may be matched with many other entities. The additional information "who was born on 2001-06-10" then helps. Next we describe an algorithm to remove redundancy from queries. One can choose to either only generate queries with no redundant component, or intentionally generate redundant queries and test QA systems in presence of redundancy.
We manage to generate minimal queries, for which there exists no sub-query having the same answer. An important theorem, as we prove in Appendix B, is the equivalency of minimality and nonredundancy: A query is minimal iff. it has no redundant component. This renders a simple algorithm for query minimization, which directly detects and removes the redundant components in a query. We first examine every edge (in an arbitrary order), and remove an edge if it is redundant. Redundant nodes will then become disconnected to the question node and are thus eliminated. It is easy to prove that the produced query (e.g., Figure 3(b)) is minimal, and has the same answer as the original query.

Commonness Checking
We now quantify the commonness of graph queries. The benefits of this study are two-fold. First, it provides a refinement mechanism to reduce too rare queries. Second, commonness is itself an important question characteristic. It is interesting to examine its impact on question difficulty. Consider the example in Figure 4, which asks for "the great-greatgrandparents of Ernest Solvay". It is minimal and logically plausible. Few users, however, are likely to come up with it. Ernest Solvay is famous for the Solvay Conferences, but few people outside the science community may know him. Although Person and parents are common, asking for the greatgreat-grandparents is quite uncommon.
A query is more common if users would more likely come up with it. We define the commonness of a query q as its probability p(q) of being picked among all possible queries from a knowledge base. The problem then boils down to estimating p(q). It is hard, if not impossible, to exhaust the whole query space. We thus make the following simplification. We break down query commonness by components, assuming mutual independence between components, and omit functions: where I q , C q , R q are the multi-set of the individuals, classes, and relations in q, respectively. Repeating components are thus accumulated (c.f. Figure 4).
We propose a data-driven method, using statistical information from the Web, to estimate p(i), p(c), and p(r). Other methods like domain-knowledge based estimation are also applicable if available. We start with entity probability p(e) (excluding literals for now). If users mention an entity more frequently, its probability of being observed in a question should be higher. We use a large entity linking dataset, FACC1 (Gabrilovich et al., 2013), which identifies around 10 billion mentions of Freebase entities in over 1 billion web documents. The estimated linking precision and recall are 80-85% and 70-85%, re- c ∈C e∈c n(e) . Estimating p(r) requires relation extraction from texts, which is hard. We make the following simplification: If (e 1 , r, e 2 ) is a fact in the knowledge base, we increase n(r) by 1 if e 1 and e 2 co-occur in a document. This suffices to distinguish common relations from uncommon ones. We then define p(r) = n(r) r ∈R n(r ) . Finally, we use frequency information from the knowledge base to smooth the probabilities, e.g., to avoid zero probabilities. The probability of literals are solely determined by the frequency information from the knowledge base. Refer to Appendix C for the resulted probability distributions.

Natural Language Conversion
In order to ensure naturalness and diversity, we employ human annotators to manually convert graph queries into natural language questions. We manage to provide two levels of paraphrasing (Figure 5). Each query is sent to multiple annotators for sentence-level paraphrasing. In addition, we use different lexical forms of an entity mined from FACC1 for entity-level paraphrasing. We provide a ranked list of common lexical forms and the corresponding frequency for each topic entity. For example, the lexical form list for UnitedStatesOfAmerica is "us" (108M), "united states" (44M), "usa" (22M), etc. Finally, graph queries are automatically translated into SPARQL queries to collect answers.
Natural language generation (NLG) Serban et al., 2016;Dušek and Jurčíček, 2015) would be a good complement to our framework, the combination of which can lead to a fully-automated pipeline to generate QA datasets. For example, Serban et al. (2016) automatically convert Freebase triples into questions with a neural network. More sophisticated NLG techniques able to handle graph queries involving multiple relations and various functions are an interesting future direction.

Experiments
We have constructed a new QA dataset, named GRAPHQUESTIONS, using the proposed framework, and tested several QA systems to show that it enables fine-grained inspection of QA systems.

Dataset Construction
We first randomly generated a set of minimal graph queries, and removed the ones whose commonness is below a certain threshold. The remaining graph queries were then screened by graduate students, and a canonical question was generated for each query, with each being verified by at least two students. We recruited 160 crowdsourcing workers from Amazon MTurk to generate sentence-level paraphrases of the canonical questions. Trivial paraphrases (e.g., "which city" vs. "what city") were manually removed to retain a high diversity in paraphrasing. At most 3 entity-level paraphrases were used for each sentence-level paraphrase.

Dataset Analysis
GRAPHQUESTIONS contains 500 graph queries, 2,460 sentence-level paraphrases, and 5,166 questions 2 . The dataset presents a high diversity and covers a wide range of domains including People, Astronomy, Medicine, etc. Specifically, it contains 148, 506, 596, 376 and 3,026 distinct domains, classes, relations, topic entities, and words, respectively. We evenly split GRAPHQUESTIONS into a training set and a testing set. All the paraphrases of the same graph query are in the same set.
While there are other question characteristics derivable from graph query, we will focus on the following ones: structure complexity, function, commonness, paraphrasing, and answer cardinality. We # of edges   use the number of edges to quantify structure complexity, and limit it to at most 3. Commonness is limited to log 10 (p(q)) ≥ −40 (c.f. Eq. 1). As shown in Section 7.4.2, such questions are already very hard for existing QA systems. Nevertheless, the proposed framework can be used to generate questions with different characteristic distributions. Some statistics are shown in Table 1 and more finegrained statistics can be found in Appendix D.
Several example questions are shown in Table 2. Sentence-level paraphrasing requires to handle both commands (the first example) and "Wh" questions, light verbs ("Who did nine eleven?"), and changes of syntactic structure ("The September 11 attacks were carried out with the involvement of what terrorist organizations?"). Entitylevel paraphrasing tests the capability of QA systems on abbreviation ("NYC" for New York City), world knowledge ("Her Majesty the Queen" for ElizabethII), or even common typos ("Shakspeare" for WilliamShakespeare). Numbers and dates are also common, e.g., "Which computer operating system was released on Sept. the 20th, 2008?" We compare several QA datasets constructed from Freebase, shown in Table 3. Datasets focusing on single-relation questions are of a larger scale, but are also of a significant lack in question characteristics. Overall GRAPHQUESTIONS presents the highest diversity in question characteristics.

Setup
We evaluate three QA systems whose source code is publicly available: SEMPRE (Berant et al., 2013), PARASEMPRE (Berant and Liang, 2014), and JA-CANA . SEMPRE and PARASEMPRE follow the semantic parsing paradigm. SEMPRE conducts a bottom-up beambased parsing on questions to find the best logical form. PARASEMPRE, in a reverse manner, enumerates a set of logical forms, generates a canonical utterance for each logical form, and ranks logical forms according to how well the canonical utterance paraphrases the input question. In contrast, JA-CANA follows the information extraction paradigm, and builds a classifier to directly predict whether an individual is the answer. They all use Freebase.
The main metric for answer quality is the average F1 score, following Berant and Liang (2014). Because a question can have more than one answer, individual precision, recall, and F1 scores are first computed on each question and then averaged. When a system generates no response for a question,   precision is 1, recall is 0, and F1 is 0. Average runtime is used for efficiency. Results are shown in percentage. Systems are trained on the training set using the suggested configurations (Appendix E). We use student's t test at p = 0.05 for significance test.

Overall Evaluation
Compared with the scores on WEBQUESTIONS (30%-40%), the scores on GRAPHQUESTIONS are lower (Table 4). This is because GRAPHQUES-TIONS contains questions over a broader range of difficulty levels. For example, it is more diverse in topics (Appendix D); also the scores become much closer when excluding paraphrasing (Section 7.4.2).
JACANA achieves a comparable F1 score with SEMPRE and PARASEMPRE on WEBQUES-TIONS . On GRAPHQUESTIONS, however, SEMPRE and PARASEMPRE significantly outperform JACANA (both p < 0.0001). The following experiments will give more insights about where the performance difference comes from. On the other hand, JACANA is much faster, showing an advantage of information extraction. The semantic parsing systems spend a lot of time on executing SPARQL queries. Bypassing SPARQL and directly working on the knowledge base may be a promising way to speed up semantic parsing on large knowledge bases .
2 WEBQUESTIONSSP is WEBQUESTIONS with manually annotated logical forms. Only those with a full logical form are included (4737 / 5810).

Fine-grained Evaluation
With explicitly specified question characteristics, we are able to further inspect QA systems.
Structure Complexity. We first break down system performance by structure. Answer quality is in general sensitive to the complexity of question structure: As the number of edges increases, F1 score decreases (Figure 6(a)). The tested systems often fail to take into account auxiliary constraints in a question. For example, for "How many children of Ned Stark were born in Winterfell?" SEMPRE fails to identify the constraint "born in Winterfell", so it also considers Ned Stark's bastard son, Jon Snow, as an answer, who was not born in Winterfell. Answering questions involving multiple relations using large knowledge bases remain an open problem. The large size of knowledge bases prohibits exhaustive search, so smarter algorithms are needed to efficiently prune the answer space. Berant and  point out an interesting direction, leveraging agenda-based parsing with imitation learning for efficient search in the answer space.
Function. In terms of functions, while SEMPRE and PARASEMPRE perform well on count questions, all the tested systems perform poorly on questions with superlatives or comparatives (Figure 6(b)). JACANA has trouble dealing with functions because it does not conduct quantitative analysis over the answer space. SEMPRE and PARASEMPRE do not generate logical forms with superlatives and comparatives, so they cannot answer such questions well.
Commonness. Not surprisingly, more common questions are in general easier to answer (Figure 6(c)). An interesting observation is that SEM-PRE's performance gets worse on the most common questions. The cause is likely rooted in how the QA systems construct their candidate answer sets. PARASEMPRE and JACANA exhaustively construct candidate sets, while SEMPRE employs a bottom-up beam search, making it more sensitive to the size of x = −5 indicates the commonness range −10 ≤ log 10 (p(q)) < 0.
the candidate answer space. Common entities like UnitedStatesOfAmerica are often featured by a high degree in knowledge bases (e.g., 1 million neighboring entities), which dramatically increases the size of the candidate answer space. During SEM-PRE's iterative beam search, many correct logical forms may have fallen off beam before getting into the final candidate set. We checked the percentage of questions for which the correct logical form is in the final candidate set, and found that it decreased from 19.8% to 16.7% when commonness increased from -15 to -5, providing an evidence for the intuition.
Paraphrasing. It is critical for a system to tolerate the wording varieties of users. We make the first effort to evaluate QA systems on paraphrasing. For each system, we rank, in descending order, all the paraphrases derived from the same graph query by their F1 score achieved by the system, and then compute the average F1 score of each rank. In Figure 6(d), the decreasing rate of a curve thus describes a system's robustness to paraphrasing; the higher, the less robust. All the systems achieve a reasonable score on the top-1 paraphrases, i.e., when a system can choose the paraphrase it can best answer. The F1 scores drop quickly in general. On the fourth-ranked paraphrases, the F1 score of SEM-PRE, PARASEMPRE, and JACANA are respectively only 37.65%, 53.2%, and 36.2% of their score on the top-1 paraphrases. Leveraging paraphrasing in its model, PARASEMPRE does seem to be more robust. The results show that how to handle paraphrased questions is still a challenging problem.
Answer Cardinality. SEMPRE and JACANA get a significantly lower F1 score (both p < 0.0001) on multi-answer questions (Table 5), mainly coming from a decrease on recall. The decrease of PARASEMPRE is not significant (p=0.29). The particularly significant decrease of JACANA demon-  strates the difficulty of training a classifier that can predict all of the answers correctly; semantic parsing is more robust in this case. The precision of SEMPRE is high because it generates no response for many questions. Note that under the current definition, the average F1 score is not the harmonic mean of the average precision and recall (c.f. Section 7.3).

Conclusion
We proposed a framework to generate characteristicrich questions for question answering (QA) evaluation. Using the proposed framework, we constructed a new and challenging QA dataset, and extensively evaluated several QA systems. The findings point out an array of issues that future QA research may need to solve.