NAMER: A Node-Based Multitasking Framework for Multi-Hop Knowledge Base Question Answering

We present NAMER, an open-domain Chinese knowledge base question answering system based on a novel node-based framework that better grasps the structural mapping between questions and KB queries by aligning the nodes in a query with their corresponding mentions in question. Equipped with techniques including data augmentation and multitasking, we show that the proposed framework outperforms the previous SoTA on CCKS CKBQA dataset. Moreover, we develop a novel data annotation strategy that facilitates the node-to-mention alignment, a dataset (https://github.com/ridiculouz/CKBQA) with such strategy is also published to promote further research. An online demo of NAMER (http://kbqademo.gstore.cn) is provided to visualize our framework and supply extra information for users, a video illustration (https://youtu.be/yetnVye_hg4) of NAMER is also available.


Introduction
With the rapid popularization of knowledge bases (KB), knowledge base question answering (KBQA) (Unger et al., 2014) has witnessed much research effort to fulfill a robust system to simplify users' access to KBs. For any given factoid question in natural language, KBQA system utilizes its background KB for answers. Recently, many SoTA KBQA systems adopt a semantic parsing (Kwiatkowski et al., 2013;Yih et al., 2014) framework, in which they convert the question to a KB query (e.g. SPARQL, Prud'hommeaux, 2008) to get answers.
Since queries are highly structured, a robust KBQA system needs to grasp the structural mapping ( Figure 1) between a question and its query. However, most previous works either adopted an end-to-end model that failed to directly use such mappings (Ge et al., 2019;Ji et al.) or devised a template or rule-based pipeline (Hu et al., 2018; Figure 1: Structural mapping between a question and its corresponding SPARQL query. Cui et al., 2017) that may lose generality in realworld applications. To preserve generality, Shen et al. (2019) incorporated a pointer generator (See et al., 2017) into the pipeline to learn the mapping between an entity and its mention. Nevertheless, the system failed to utilize the mappings of variables, literals, and types in a query.
In this paper, we argue that learning the complete question-query mapping (i.e. the alignments of all nodes to their mention, as in Figure 1) aids the system to achieve better performance. Hence, we supplement an open-domain complex Chinese KBQA dataset with annotations of all node mentions. Based on the additional data, we propose a novel node-based multi-hop KBQA framework that fully grasps the mappings of entities, variables, literals, and types. Unlike prior works, we generate the pointer of all query nodes to their mention to represent the mapping and exploit such mappings in downstream relation extraction task. Also, we explore techniques including multitasking to further improve model performance. Based on the framework, we implement a publicly available Chinese KBQA system, NAMER, for users to query KBs by natural language, which offers convenience for non-expert users to use KB and is thus fairly useful in practice. In short, the contributions of this work are: 1) we propose a novel KBQA framework with a strong ability to grasp structural mapping ,  N  o  d  e   E  x  t  r  a  c  t  i  o  n   M  e  n  t  i  o  n D  e  t  e  c  t  i  o  n  E  n  t  i  t  y  L  i  n  k  i  n  g    R  e  l  a  t  i  o  n   E  x  t  r  a  c  t  i  o  n   C  a  n  d  i  d  a  t  e  R  e  l  a  t  i  o  n   G  e  n  e  r  a  t  o  r  R  e  l  a  t  i  o  n R  a  n  k  i  n  g     s e l e c   the approach reaches SoTA results on a Chinese KBQA dataset, 2) we present a new data annotation format to better train KBQA models and publish a supplementary dataset of this format to prompt future research and 3) we implement an online demonstration of NAMER that can visualize our framework and aid users to explore KBs.

System Overview
This section explains the overall architecture of the proposed framework and the UI of the system. Figure 2 illustrates the architecture of our framework. Basically, the framework can be divided into three modules, namely node extraction (NE), query generation (QG) and relation extraction (RE). Given a natural language question, NE extracts mentions of entities, variables, literals, types and performs entity linking. Meanwhile, QG generates a node sequence (i.e. vertices in the KB query, see Section 3.1 and 3.2 for more details) corresponding to the given question. Each node generated in QG consists of its type and the pointer to its mention in the input question, such pointer is replaced by the node extracted in NE when fusing NE and QG results. Up to now, we can generate the vertices of a SPARQL query, i.e. the head and tail of all its triples; to form a complete SPARQL output, RE (Section 3.3) is introduced to decide the edges (i.e. the relation of all triples). For each pair of nodes given by NE+QG, RE takes the raw question and mentions of the head and tail node as inputs to decide the relation between them. Combining all three modules, a SPARQL query is finally composed and sent to a knowledge base to get answers.

User Interface
An example of the interaction between users and our system is illustrated in Figure 3. With this UI, users can not only consult NAMER to answer their questions but also acquire more information around their interested entities and understand how NAMER works to compose the generated query.

Model
Consider the question "Where was Yao Ming's daughter born?", the following section elaborates how each module process the question to compose the correct SPARQL "select ?y where { <Yao_Ming> <daughter> ?x. ?x <place_of_birth> ?y. }".

Node Extraction (NE)
We define nodes as entities, variables, literals and types in a SPARQL query, namely the entity <Yao_Ming> and the variable ?x and ?y in the case above. The NE module aims to detect mentions of all nodes in a question, i.e. "Yao Ming", "daughter", and "where" respectively. To achieve this, we utilize a transformer (Vaswani et al., 2017) encoder with a sequence-tagging head of tag space {O, Eb, Ei, Vb, Vi, VTb, VTi, Tb, Ti, VLb, VLi, Lb, Li} (VL/VT denotes variable-literal/variable-type since mentions of multiple nodes may overlap) to tag the question. Afterward, NE performs entity linking on extracted entities via a mention-to-entity dictionary corresponding to the KB. For each entity Figure 3: The user interface of NAMER. By entering a question and setting up a few parameters, a user can receive the output SPARQL and answer with intermediate results to visualize our framework. For instance, a user can check "Triples in the SPARQL" for the structure of the generated triples. Besides, after clicking the "Show Candidate Relations" button of each triple, its top score candidate relations would be displayed below; after clicking the "Show Candidate Entities", the scores of candidate entities in entity linking are also provided. mention, we select its longest substring that appears in the dictionary and view the entities linked to such substring as candidate entities.

Query Generation (QG)
In QG, we want to generate the node sequence of the expected SPARQL, i.e. [?y, <Yao_Ming>, ?x, ?x, ?y] for the instance above (the first node is the selected variable). One direct method to do so is to adopt a decoder that directly generates such sequence. However, as mentioned before, such an approach poses difficulties for models to grasp the query-question mapping. Hence, we adopt a pointer network (See et al., 2017) to generate a sequence of <type (entity, variable, etc.), pointer> to represent node sequence.
More specifically, QG model is based on a transformer encoder and decoder. Let H E ∈ R n×d h be the encoder output given the question as input, let T ∈ N q be the previously generated node types (n is the question length, d h denotes hidden dimension, q is the length of the node sequence) by the decoder. At each decoding step, hidden vector h q of the current node is generated, which is then fed to an FFN to represent the type of next node T next ∈ {E, V, L, T, Start, End}. We concatenate T next to T for the next decoding step.
An attention matrix W att ∈ R d h ×d h is trained to calculate attention score of each input token and the pointer P tr cur being the input with max score.
When combining NE and QG results, we can replace each pointer with the node it points to given by NE, e.g. replacing <var, 5> with a variable node "?daughter" with mention "daughter". Consequently, the expected node sequence [?where, <Yao_Ming>, ?daughter, ?daughter, ?where] can now be formed.

Relation Extraction (RE)
RE module aims to determine the relation of each node pair generated in QG, i.e. determining the relation <daughter> between <Yao_Ming> and ?x and <place_of_birth> between ?x and ?y for the aforementioned case. We complete this in a ranking manner, which is, we first generate candidate relations for each node pair n 1 and n 2 (next paragraph), then, we concatenate each candidate with the raw question and the mentions of head and tail nodes to form model input. Such input is encoded by a transformer encoder and converted to a number S ∈ [0, 1] to represent the score of such candidate relation. RE module selects top-scored candidates of each node pair to form output SPARQL. More specifically, since relations are directional in KB, we obtain candidates of both positive (from n 1 to n 2 ) and reversed (from n 2 to n 1 ) directions, marked as R pos and R rev respectively. Suppose a positive relation r * is the correct choice and q is the question, for each r 1 /r 2 in R pos /R rev excluding r * , we construct (q, n 1 , n 2 , r 1 )/(q, n 2 , n 1 , r 2 ) as negative samples and (q, n 1 , n 2 , r * ) as a positive sample to train our model. For each node pair, we query KB to obtain candidate relations. For pairs with an entity, literal or type (deterministic) node in it, we view those relations around that node in KB as candidates; for pairs merely consist of variables, we trace back the route from these variables to any deterministic node and view the relations k-hop away from the deterministic node as candidates. For instance, if three pairs (<Yao_Ming>, ?x), (?x, ?y) and (?y, ?z) are generated, their candidates are 1, 2 and 3-hop away from entity <Yao_Ming> in KB respectively.
Additionally, we propose an augmentation method when training RE model. Back to the case above, we also add (q, n 2 , n 1 , r 1 )/(q, n 1 , n 2 , r 2 ) and (q, n 2 , n 1 , r * ) to negative samples when training. Consequently, the model learns the effects of mention order to the prediction, through which it may learn a better scoring policy. See further analysis in Section 4.4.

Multitasking
Clearly, since all modules above have an encoder, we can share it across different models in the hope of better comprehension and less error propagation. Let loss N E , loss type , loss ptr , loss RE be the losses of NE, QG-type, QG-pointer, and RE respectively, we can co-train the models by minimizing the weighted sum over all losses. loss = γ * loss N E + α * loss type +β * loss ptr + θ * loss RE We can also multitask on a subset of modules by setting some hyperparameters (γ, α, β, θ) to zero.

Experimental Setup
Dataset We utilize the dataset published in CCKS Chinese KBQA Contest 4 for evaluation. The dataset consists of various Chinese opendomain complex (multi-hop) questions that require deep comprehension of questions and strong generalization ability, its background KB is PKUBASE 5 , a Chinese KB based on Baidu Baike. We follow the raw separation of 2.2k/0.76k/0.76k train/dev/test data, note that no information in dev or test set are used when training.
Annotations We manually label the mention of all SPARQL nodes in the question required by our framework in the train and dev set. When multiple mentions co-refer a node, all mentions are accepted but we recommend annotators to choose a more informative one, e.g. for the question "Who is Yao Ming's daughter?" and SPARQL "select ?x where {<Yao_Ming> <daughter> ?x.}", both "daughter" and "who" refer to ?x, but the former is preferred. When no mention refers to a node, annotators leave the mention as "None". We perform a brief double-check on 420 randomly selected questions and >93% of which are annotated correctly. See more details of the annotation process in Ethical Considerations.
Baselines We compare our results with the top ranking team "jchl" 6    Table 2: Performance details of different multitasking strategies. "Methods" refer to co-trained modules, "Separate" means no multitasking. Metrics in NE refers to the P/R/F1 of the extracted node list. In QG, EM (exact-match) and Actual Acc. means the accuracy of generated node sequence (yield of NE&QG); the former counts when the generated sequence is identical to gold sequence while the latter counts when two sequences are equivalent semantically, e.g. when gold and generated sequence are [?daughter, <Yao_Ming>, ?daughter] and [?who, <Yao_Ming>, ?who], EM Acc. does not count due to false pointer of the variable but they are semantically equal since the name of a variable does not effect query results. For RE, Hit@5 denotes the ratio of node pairs whose score of gold relation is among top-5 in all candidates; MRR was defined in Craswell, 2009. Overall F1 is explained in Table 1. We evaluate all NE, QG, and RE-related metrics on dev set. et al., 2018) that reached first place in QALD-9 (Ngomo, 2018). Since the NE and RE module in gAnswer does not officially support Chinese, we replace them with those in our system. Hence, the gAnswer evaluated can be partly viewed as our system with a rule-based QG module and its comparison with us indicates the effectiveness of our generative QG module.
Setup We adopt Chinese RoBERTa-large (Cui et al., 2020) in transformers library (Wolf et al., 2020) released by HFL 8 as encoder and a 6-layer 8-head transformer as decoder. For our best results, we co-train the NE and QG models, remaining RE as a separate model. For NEQG, we train the encoder and decoder with learning rate 1e-6 and 4e-6 respectively with an Adam (Kingma and Ba, 2015) optimizer, setting hyperparameters to γ = 1, α = β = 2.5, θ = 0 and batch size to 40. For RE model, we set the learning rate and batch size to 1e-5 and 96 respectively with γ = α = β = 0, θ = 1. Both models are trained until no progress on validation accuracy for at most 10k steps. Table 1 compares the performance of our system and the baselines on official F1 metrics as well as precision and recall. As shown, our system consistently outperforms the contest winner "jchl" on dev and test set while significantly surpass the modified version of gAnswer, setting up a new state-of-theart performance on the evaluated dataset. We attribute the improvement to the effective-ness of the proposed framework. With the cooperation of NE and QG, NAMER learns the direct mapping between question and query, making it possible for models to deeply grasp their supervision signals even in case of complex questions and insufficient training data, which is exactly the case for the current dataset. Since the evaluated gAnswer can be viewed as replacing QG with a rulebased subgraph matching module, our advantage over it also implies the superiority of a trainable generative module in KBQA which, we speculate, has better generalization ability facing the highly diversified questions. Finally, based on NEQG, our RE module can naturally deal with complex multihop questions by processing a triple (instead of a question) at a time, resulting in an accurate relation scoring for every node pair.

Analysis of Multitasking
In his section, we try to discuss the impact of different multitasking strategies (Section 3.4) on the framework performance. The results of each module and the overall metrics are given in Table 2. Evidently, multitasking NE and QG consistently improves performance over no multitasking; this is probably due to the shared supervise signals across NE and QG offer extra information for models to better comprehend their tasks. E.g., the supervision in NE tells QG model the semantics of a pointer (since it provides the node mention of a pointer) which assists QG to predict pointers. However, when multitasking all three modules, the performance fails to improve. In detail, although NE and QG metrics resemble our best results, RE encounters a considerable drop on both metrics. We

RE Overall F1
Hit@5 MRR Dev-set Test-set   speculate that the different input format of NEQG and RE results in a different semantic space on the tasks, which harms the performance when we forcibly co-train them. Anyway, multitasking notably reduces the storage cost of our system by sharing one encoder across various tasks, which is significant for a system in practice.

Analysis of Data Augmentation
A data augmentation technique is introduced in Section 3.3, we inspect its effect in Table 3 and provide further discussion in this section. As illustrated, removing augmentation from RE results in a drop on both RE metrics and overall performance, indicating the positive effect of augmentation. To explain, we perform a case study on the question What's the nickname of Tom's elder brother? 9 . Consider the node pair n 1 = T om, n 2 = elder brother, we compare the top-scored relation from n 1 to n 2 and from n 2 to n 1 given by the model with and without augmentation. As shown in Table 4, the augmented model outputs two antonymous relations in two directions while its counterpart makes two same predictions. Hence, we argue that the augmented training data help the model to concurrently learn 1) the topic-level relationship between a relation and a node pair in question and 2) the effect of node direction to relation (i.e. r * is only the correct choice from n 1 to n 2 , not conversely). The extra supervise signal enables a deeper comprehension of RE which, in turn, improves model performance. Interestingly, we find a similar discussion in Lan et al. (2019) on the advantage of SOP over NSP, since sentence order provides additional supervi- 9 We translate raw Chinese input to English in this case. sion on discourse-level coherence (which largely resembles the coherence between node direction and relation in our case). Thus, we speculate that similar augmentation methods may work in more scenarios in future research.

Related Work
Semantic parsing-based KBQA A semantic parser in KBQA converts a question to a KB query. Previously, some works (Petrochuk and Zettlemoyer, 2018;Mohammed et al., 2018) only focus on answering one-hop questions. To process multi-hop questions, Cui et al. (2017) proposed a template-based pipeline in which a question is converted to a template to further decide its predicate. Hu et al. (2018)  To help models directly comprehend the structural mapping, Wang et al. adopted a query template generator as well as an entity and relation extractor to represent the mentions of entities and relations; however, they failed to utilize the mention of variables and literals. Similar to our approach, Shen et al. (2019) used a pointer-generator and entity extractor to grasp the mapping between an entity and its mention, but the mappings of other types of nodes are omitted in their work, also, unlike us, the mappings failed to directly assist downstream RE task. Different from the above, we propose a framework that grasps the mappings of all node types and use them to aid downstream tasks.
Public KBQA systems Prior to us, several online KBQA systems are available for the public. However, most systems focused on domain-specific KBs, e.g. E-commerce  and food (Haussmann et al., 2019

Conclusion
We present a robust Chinese KBQA system, NAMER, based on a novel node-based multitasking framework. With three cooperative modules, our system grasps the structural mapping between a question and its corresponding query. Hence, NAMER reaches superior performance compared to previous SoTA on an open-domain Chinese complex KBQA dataset. Further experiments also demonstrate the effectiveness of the architecture and the techniques adopted in NAMER. As a system intended for easier access to KB for all users, the UI of NAMER provides not only the answers to a given question but also the query structure accompanied by a series of intermediate results (e.g. candidates & scores), assisting users to visualize our system pipeline and explore more KB information to their interest.
For the future, we will incorporate more visualization functions into NAMER to further reduce the barrier to KB for nonspecialist users. Since extra data annotations are required to support NAMER, we also plan to study the effects of the scale of annotated data on system performance. Moreover, we expect to implement and optimize NAMER in multilingual scenarios.

Ethical Considerations Data Collection
Annotation Guideline SPARQL queries usually include several triples, restricting the range of target answers. Nodes are defined as the entities, variables, literals, and types in a SPARQL (including the select variable). For instance, the SPARQL select ?x where {<Yao_Ming> <daughter> ?x.} corresponds to the node sequence [?x, <Yao_Ming>, ?x]. Given a natural language question "Who is Yao Ming's daughter?" and its corresponding SPARQL, annotators are asked to annotate the mention span of every node in the question, i.e. "Yao Ming" for <Yao_Ming> and "daughter" for ?x.
Annotation Details The questions were distributed evenly to seven annotators with substantial knowledge of NLP. To ensure that the annotators were comfortable with the task, annotation guidance was given before the task began. After the primary annotation, two annotators double-checked the annotation to ensure consistency. All annotators worked part-time on the task.

System Output
We provide an online Chinese KBQA system as shown in Figure 3. The system uses PKUBASE as its supportive KB and accepts Chinese questions as possible input. Despite our efforts to eliminate biased and offensive output, NAMER retains the potential to generate answers that may be wrong or trigger offense. This failure may be induced by the deficiency of PKUBASE, implicit bias of the pretrained model and the limitation of training data. These are known issues in current state-ofthe-art neural network-based language models and automatically constructed knowledge base. In no case should inappropriate answers generated by NAMER be construed to reflect the views or values of the authors.