Lifelong and Interactive Learning of Factual Knowledge in Dialogues

Dialogue systems are increasingly using knowledge bases (KBs) storing real-world facts to help generate quality responses. However, as the KBs are inherently incomplete and remain fixed during conversation, it limits dialogue systems’ ability to answer questions and to handle questions involving entities or relations that are not in the KB. In this paper, we make an attempt to propose an engine for Continuous and Interactive Learning of Knowledge (CILK) for dialogue systems to give them the ability to continuously and interactively learn and infer new knowledge during conversations. With more knowledge accumulated over time, they will be able to learn better and answer more questions. Our empirical evaluation shows that CILK is promising.


Introduction
Dialogue systems, including question-answering (QA) systems are now commonly used in practice.
Early such systems were built mainly based on rules and information retrieval techniques (Banchs and Li, 2012;Ameixa et al., 2014;. Recent deep learning models (Vinyals and Le, 2015;Xing et al., 2017;Li et al., 2017c) learn from large corpora.
However, since they do not use explicit knowledge bases (KBs), they often suffer from generic and dull responses (Xing et al., 2017;. KBs have been used to deal with the problem (Ghazvininejad et al., 2018;Le et al., 2016;Long et al., 2017;. Many task-oriented dialogue systems (Eric and Manning, 2017;Madotto et al., 2018) also use KBs to support information-seeking conversations.
One major shortcoming of existing systems that use KBs is that the KBs are fixed once the di-alogue systems are deployed. However, it is almost impossible for the initial KBs to contain all possible knowledge that the user may ask, not to mention that new knowledge appears constantly. It is thus highly desirable for dialogue systems to learn by themselves while in use, i.e., learning on the job in lifelong learning . Clearly, the system can (1) extract more knowledge from the Web or other sources, and (2) learn directly from users during conversations. This paper focuses on the latter and makes an attempt to propose an engine for Continuous and Interactive Learning of Knowledge (CILK) to give the dialogue system the ability to acquire/learn new knowledge from the user during conversation. Specifically, it focuses on learning new knowledge interactively from the user when the system is unable to answer a user's WH-question. The acquired new knowledge makes the system better able to answer future user questions, and no longer be limited by the fixed knowledge provided by the human developers.
The type of knowledge that the CILK engine focuses on is the facts that can be expressed as triples, (h, r, t), which means that the head entity h and the tail entity t can be linked by the relation r. An example of a fact is (Boston, LocatedInCountry, USA), meaning that Boston is located in USA. This paper only develops the core engine. It does not study other dialogue functions like response generation, semantic parsing, fact extraction from user utterances, entity linking, etc., which have been studied extensively before and are assumed to be available for use. Thus, this paper works only with structured queries (h, r, ?), e.g., (Boston, Lo-catedInCountry, ?) meaning "In what Country is Boston located ?," or (?, r, t), e.g., (?, PresidentOf, USA) meaning "Who is the President of USA?" It assumes that a semantic parser is available that can convert natural language queries from users into query triples. Similarly, it assumes an information extraction tool like OpenIE (Angeli et al., 2015) is employed to extract facts as triples (h, r, t) from user's utterances during conversation. Building a full-fledged dialogue system that can also learn during conversation is a huge undertaking and is out of the scope of this paper. We thus only investigate the core knowledge learning engine here. We also assume that the user has good intentions (i.e., user answers questions with 100% conformity about the veracity of his/her facts) 1 ; but is not omniscient (opposed to the teacher-student learning setup).
Problem Definition: Given a user query / question (h, r, ?) [or (?, r, t)], where r and h (or t) may not be in the KB (i.e., unknown), our goal is two-fold: (i) answering the user query or rejecting the query to remain unanswered in the case when the correct answer is believed to not exist in the KB and (ii) learning / acquiring some knowledge (facts) from the user to help the answering task. We only focus on the setting where the query cannot be answered directly with the current KB and need inference over existing facts, as considering structured query, it's trivial to retrieve the answer if the answer triple is already in KB. We further distinguish two types of queries: (1) closed-world queries, where h (or t) and r are known to the KB, and (2) open-world queries, where either one or both h (or t) and r are unknown to the KB.
It is easy to see that the problem is essentially a lifelong learning problem , where each query to be processed is a task and the knowledge gained is retained in the KB. To process a new query/task, the knowledge learned and accumulated from the past queries can be leveraged.
For each new open-world query, the proposed approach works in two steps: Step 1 -Interact with the user: It converts open-world queries (2) to closed-world queries (1) by asking the user questions related to h (or t) and r to make them known to the KB (added to KB). The reason for the conversion will be clear below. The user answers, called supporting facts (SFs), are the new knowledge to be added to KB. This step is also called interactive knowledge learning. Note, closed-world queries (1) do not need this 1 We envision that the proposed engine is incorporated into a dialogue system in a multi-user environment. The system can perform cross-verification with other users by asking them whether the knowledge (facts) from a user is correct. step.
Step 2 -Infer the query answer: It solves closed-world queries (1) by inferring the query answer. The main idea is to use each entity e in the KB to form a candidate triple (h, r, e) (or (e, r, t)), which is then scored. The entity e with the highest score is predicted as the answer of the query.
Scoring each candidate is modeled as a knowledge base completion (KBC) problem (Lao and Cohen, 2010;Bordes et al., 2011). KBC aims to infer new facts (knowledge) from existing facts in a KB and is defined as a link prediction problem: Given a query triple, (e, r, ?) [or (?, r, e)], it predicts a tail entity t true [head entity h true ] which makes the query triple true and thus should be added to the KB. KBC makes the closed-world assumption that h, r and t are all known to exist in the KB (Lao et al., 2011;Bordes et al., 2011Bordes et al., , 2013Nickel et al., 2015). This is not suitable for knowledge learning in conversations because in a conversation, the user can ask or say anything, which may contain entities and relations that are not in the KB. CILK removes the closed-world assumption and allows all h (or t) and/or r to be unknown (not in the KB).
Step 1 above basically asks the user questions to make h (or t) and/or r known to the KB. Then, an existing KBC model as a query inference model can be applied to retrieve an answer entity from KB. Figure 1 shows an example. CILK acquires supporting facts SF1 and SF2 to accomplish the goal of knowledge learning and utilizes these pieces of knowledge along with existing KB facts to answer the user query (i.e., to infer over the query relation "LocatedInCountry"). CILK aims to achieve these two sub-goals. The new knowledge (SFs) is added to the KB for future use 2 . We evaluate CILK using two real-world KBs: Nell and WordNet and obtain promising results.

Related Work
To the best of our knowledge, no existing system can perform the proposed task. We reported a priliminary research in (Mazumder et al., 2018).
CILK is related to interactive language learning (Wang et al., 2016, which is mainly about language grounding, not about knowledge learning. Li et al. (2017a,b) and Zhang et al. (2017) train chatbots using human teachers who can ask and answer the chatbot questions. Ono et al. (2017), Otsuka et al. (2013), Ono et al. (2016) and  allow a system to ask the user whether its prediction of category of a term is correct or not. Compared to these works, CILK performs interactive knowledge learning and inference (over existing and acquired knowledge) while conversing with users after the dialogue system has been deployed (i.e., learning on the job ) without any teacher supervision or help.
NELL (Mitchell et al., 2015) updates its KB using facts extracted from the Web (complementary to our work). We do not do Web fact extraction.
KB completion (KBC) has been studied in recent years (Lao et al., 2011;Bordes et al., 2011Bordes et al., , 2015Mazumder and Liu, 2017). But they mainly handle facts with known entities and relations. Neelakantan et al. (2015) work on fixed unknown relations with known embeddings, but does not allow unknown entities. Xiong et al. (2018) also deal with queries involving unknown relations, but known entities in the KB. Shi and Weninger (2018) handles unknown entities by exploiting an external text corpus. None of the KBC methods perform conversational knowledge learning like CILK.

Proposed Technique
As discussed in Sec. 1, given a query (e, r, ?) [or (?, r, e)] 3 from the user, CILK interacts with the user to acquire supporting facts to answer the 2 The inferred query answer is not added to the KB as it may be incorrect. But it can be added in a multi-user environment through cross-verification (see footnote 1 and Sec. 4).
3 Either e or r or both may not exist in the KB query. Such an interactive knowledge learning and inference task is realized by the cooperation of three primary components of CILK: Knowledge base (KB) K, Interaction Module I and Inference Model M. The interaction module I decides whether to ask or not and formulates questions to ask the user for supporting facts. The acquired supporting facts are added to the KB K and used in training the Inference Model M which then performs inference over the query (i.e., answers the query).
In the following subsections, we formalize the interactive knowledge learning problem (Sec. 3.1), describe the Inference Model M (Sec. 3.2) and discuss how CILK interacts and processes a query from the user (Sec. 3.3).

Problem Formulation
where E is the entity set and R is the relation set. Let q be a query of the form (e, r, ?) [or (?, r, e)] issued to CILK, where e is termed as query entity and r as the query relation. If e / ∈ E and/or r / ∈ R (we also say e, r / ∈ K), we call q an open-world query. Otherwise, q is referred to as a closed-world query, i.e., both e and r exist in K. Given K and a query q, the query inference task is defined as follows: If q is of the form (e, r, ?), the goal is to predict a tail entity t true ∈ E such that (e, r, t true ) holds. We call such q a tail query. If q is of the form (?, r, e), the goal is to predict a head entity h true ∈ E such that (h true , r, e) holds. We call such q a head query. In the open-world setting, it's quite possible that the answer entity t true (for a tail query) or h true (for a head query) does not exist in the KB (in E). In such cases, the inference model M cannot find the true answer. We thus further extend the goal of query inference task to either finding answer entity t true (h true ) for q or rejecting q to indicate that the answer does not exist in E.
Given an open-world (head / tail) query q from user u, CILK interacts with u to acquire a set of supporting facts (SFs) [i.e., a set of clue triples C r involving query relation r and/or a set of entity fact triples F e involving query entity e] for learning r and e (discussed in Sec 3.3). In Figure 1, (London, LocatedInCountry, UK) is a clue of query relation "LocatedInCountry" and (Harvard University, UniversityLocatedIn, Boston) is an entity fact involving query entity "Boston". In this interaction process, CILK decides and asks questions to the user for knowledge acquisition in multiple dialogue turns (see Figure 1). This is step 1 as discussed in Sec. 1 and will be further discussed in Sec. 3.3.
Once SFs are gathered, it uses (K ∪ C r ∪ F e ) to infer q, which is step 2 in Sec. 1 and will be detailed in Sec. 3.2. We refer to the whole interaction process involving multi-turn knowledge acquisition followed by the query inference step as a dialogue session. In summary, CILK is assumed to operate in multiple dialogue sessions with different users and acquire knowledge in each session and thereby, continuously learns new knowledge over time.

Inference Model
Given a query q, the Inference Model M attempts to infer q by predicting the answer entity from E. In particular, it selects each entity e i ∈ E and forms |E| number of candidate triples {d 1 , ..., d |E| }, where d i is of the form (e, r, e i ) for a tail query [or (e i , r, e) for a head query] and then score each d i to quantify the relevancy of e i of being an answer to q. The top ranked entity e i is returned as the predicted answer of q. We deal with the case of query rejection by M later.
We use the neural knowledge base embedding (KBE) approach (Bordes et al., 2011(Bordes et al., , 2013Yang et al., 2014) to design M. Given a KB represented as a triple store, a neural KBE method learns to encode relational information in the KB using low-dimensional representations (embeddings) of entities and relations and uses the learned representations to predict the correctness of unseen triples. In particular, the goal is to learn representations for entities and relations such that valid triples receive high scores (or low energies) and invalid triples receive low scores (or high energies) defined by a scoring function S(.). The embeddings can be learned via a neural network. In a typical (linear) KBE model, given a triple (h, r, t), input entity h, t and relation r correspond to highdimensional vectors (either "one-hot" index vector or "n-hot" feature vector) x h , x t and x r respectively, which are then projected into low dimensional vectors v h , v t and v r using an entity embedding matrix W E and relation embedding matrix W R as given by The scoring function S(.) is then used to compute a validity score S(h, r, t) of the triple.
Any KBE model can be used for learning M. For evaluation, we adopt DistMult (Yang et al., 2014) for its state-of-the art performance over many other KBE models (Kadlec et al., 2017). The scoring function of DistMult is defined as follows: The parameters of M, i.e., W E and W R , are learned by minimizing a margin-based ranking objective L, which encourages the scores of positive triples to be higher than those of negative triples: where, D + is a set of triples observed in K, treated as positive triples. D − is a set of negative triples obtained by corrupting either head entity or tail entity of each +ve triple (h, r, t) in D + by replacing it with a randomly chosen entity h ′ and t ′ respectively from K such that the corrupted triples (h ′ , r, t), (h, r, t ′ ) / ∈ K. Note, M is trained continuously by sampling a set of +ve triples and correspondingly constructing a set of -ve triples as the KB expands with acquired supporting facts to improve its inference capability over new queries (involving new query relations and entities). Thus, the embedding matrices W E and W R also grow linearly over time.
Rejection in KB Inference. For a query with no answer entity existing in K, CILK attempts to reject the query from being answered. To decide whether to reject the query or not, CILK maintains a threshold buffer T that stores entity and relation specific prediction thresholds and updates it continuously over time, as described below.
Besides the dataset for training M, CILK also creates a validation dataset D vd , consisting of a set of validation query tuples of the form (q, E + , E − ). Here, q is either a head or tail query involving query entity e and relation r, E + ={e + 1 , .., e + p } is the set of p positive (true answer) entities in K and E − ={e − 1 , .., e − n } is the set of n negative entities randomly sampled from K such that e ∈ q} be the validation query tuple set involving entity e and D r vd = {(q, E + , E − ) | (q, E + , E − ) ∈ D vd , r ∈ q} be the validation query tuple set involving relation r. Then, we compute T [z], (i.e., prediction threshold

Algorithm 1 CILK Knowledge Learning and Inference
Input: query qj = (e, r, ?) or (?, r, e) issued by user at session-j; Kj : CILK's KB at session-j; Pj: Performance Buffer at session-j; Tj : Threshold Buffer at session-j; Mj: trained Inference Model at session-j; α: probability of treating an acquired supporting fact as training triple; ρ: % of entities or relations in Kj that belong to the diffident set. Output: e : predicted entity as answer of query qj in session-j. for z, where z is either e or r) as the average of the mean scores of triples involving +ve entities and mean scores of triples involving -ve entities, computed over all q in D z vd , given by- where µ + E = 1 Here, S(q, e + i ) = S(e, r, e + i ) if q is a tail query and S(e + i , r, e) if q is a head query. S(q, e − i ) can be explained in a similar way.
Given a head or tail query q involving query entity e and relation r, we compute the prediction threshold µ q for q as µ q = max{T [e], T [r], 0}.
Inference Decision Making. If e ∈ E is the predicted answer entity by M for query q and S(q, e) > µ q , CILK responds to user with answer e. Otherwise, q gets rejected.

Working of CILK
Given a query q involving unknown query entity e and/or relation r, CILK has to ask the user to provide supporting facts to learn embeddings of e and r in order to infer q. However, the user in a given session can only provide very few supporting facts, which may not be sufficient for learning good embeddings of e and r. Moreover, to accumulate a sufficiently good validation dataset for learning T [e] and T [r], CILK needs to gather more triples from users involving e and r. But, asking for SFs for any entity and/or relation can be annoying to the user and also, is unnecessary if CILK has already learned good emmbeddings of that entity and/or relation (i.e., CILK has performed well in predicting true answer entity for queries involving that entity and/or relation in past dialogue sessions with other users). Thus, it is more reasonable to ask for SFs for the known entities and/or relations for which CILK is not confident about performing inference accurately, besides the unknown ones.
To minimize the rate of user interaction and justify the knowledge acquisition process, CILK uses a performance buffer P to store the performance statistics of CILK in past dialogue sessions. We use Mean Reciprocal Rank (MRR) to measure the performance of M (discussed in Sec. 4.1). In particular, P[e] and P[r] denote the avg. MRR achieved by M while answering queries involving e and r respectively, evaluated on the validation dataset D vd . At the end of each dialogue session, CILK detects the set of bottom ρ% query relations and entities in P based on MRR scores evaluated on the validation dataset. We call these sets the diffident relation and entity sets respectively for the next dialogue session. If the query relation and/or entity issued in the next session belongs to the diffident relation or entity set, CILK asks the user for supporting facts 4 . Otherwise, it proceeds with inference, answering or rejecting the query.
Algorithm 1 shows the interactive knowledge learning and inference process of CILK on a query q j = (e, r, ?) or (?, r, e) in a given dialogue session-j. Let K j , P j , T j and M j be the current version of KB, performance buffer, threshold buffer and inference model of CILK at the point when session-j starts. Then, the interactive knowledge learning and inference proceeds as follows: • If r / ∈ K j or r is diffident in P j , the interaction module I of CILK asks the user to provide clue(s) C r involving r [Line 1-3]. Similarly, if e / ∈ K j or e is diffident in P j , I asks the user to provide entity fact(s) F e involving e [Line 4-6].
• If the user provides C r and/or F e , I augments K j with triples from C r and F e respectively and K j expands to K j+1 [Line 7-12]. In this process, α % of the triples in C r and F e are randomly marked as training triples and rest (1 − α)% are marked as validation triples while storing them in K j .
• Next, a set of training triples D r tr , D e tr and a set of validation triples D r vd , D e vd are sampled randomly from K j+1 involving r and e respectively [Line 13-14] for training and evaluating M j . While sampling, we set the ratio of number of training triples to that of validation triples as α to maintain a fixed training and validation set distribution. The size for (D r tr ∪ D e tr ) is set at most N tr (tuned based on real-time training requirements).
• Next, M j is trained with (D r tr ∪ D e tr ) and gets updated to M j+1 [Line 15]. Note that, training M j with (D r tr ∪ D e tr ) encourages M j to learn the embeddings of both r and e before inferring q j . Then, we evaluate M j+1 with (D r vd ∪ D e vd ) in order to update the performance buffer P j into P j+1 and threshold buffer T j into T j+1 [Line 16]. Finally, M j+1 is invoked by CILK to either infer q j for predicting an answer entity e from K j+1 [Line 17] or reject q j to indicated that the true answer does not exist in K j+1 . Note, CILK trains M j and infers q [Line 13-17] only if e, q ∈ K j+1 .

Experiments
As indicated earlier, the proposed CILK system is best used in a multi-user environment, so it naturally observes many more query triples (hence, accumulates more facts) from different users over time. Presently CILK fulfills its knowledge learning requirement by only adding the supporting facts into the KB. The predicted query triples are not added as they are unverified knowledge. However, in practice, CILK can store these predicted triples in the KB as well after checking their correctness through cross-verification while conversing with other users in some future related conversations by smartly asking them. Note that CILK may not verify its prediction with the same user who asked the question/query q because he/she may not know the answer(s) for q. However, there is no problem that it acquires the correct answer(s) of q when it asks q to some other user u ′ in a future related conversation and u ′ answers q. At this point, CILK can incorporate q into its KB and also, train itself using triple q. We do not address the is-

Evaluation Setup
Evaluation of CILK with real users in a crowdsource based setup would be very difficult to conduct and prohibitively time-consuming (and expensive) as it needs a large number of real-time and continuous user interaction. Thus, we design a simulated interactive environment for the evaluation.
We create a simulated user (a program) to interact with CILK, where the simulated user issues a query to CILK and CILK answers the query. The (simulated) user has (1) a knowledge base (K u ) for answering questions from CILK, and (2) an query dataset (D q ) from which the user issues queries to CILK. 5 Here, D q consists of a set of structured query triples q of the form (e, r, ?) and (?, r, e) readable by CILK. In practice, the user only issues queries to CILK, but cannot evaluate the performance of the system unless the user knows the answer. To evaluate the performance of CILK on D q in the simulated setting, we also collect the answer set for each query q ∈ D q (discussed shortly).
As CILK is supposed to perform continuous online knowledge acquisition and learning, we evaluate its performance on the streaming query dataset. We assume that, CILK has been deployed with an initial knowledge base (K b ) and the inference model M has been trained over all triples in K b for a given number of epochs N init . We call K b the base KB of CILK which serves as its knowledge base at the time point (t eval ) when our evaluation starts. And the training process of M using triples in K b is referred to as the initial training phase of CILK onwards. In the initial training phase, we randomly split K b triples into a set of training triples D tr and a set of validation triples D vd with 9:1 ratio (we use α = 0.9) and train M with D tr . D vd is used to tune model hyperparameters and populate initial performance and threshold buffers P and T respectively. D tr , D vd , P, and T get updated continuously after t eval in the online training and evaluation phase (with new acquired triples) during interaction with the simulated user.
The relations and entities in K b are regarded as known relations and known entities to CILK till t eval . Thus, the initial inference model M is trained and validated with triples involving only known relations and known entities (in K b ). During the online training and evaluation phase, CILK faces queries (from D q ) involving both known and unknown relations and entities. More specifically, if a relation (entity) appearing in a query q ∈ D q exists in K b , we consider that query relation (entity) as known query relation (entity). Otherwise, it is referred to as unknown query relation (entity).
We create simulated user's KB K u , base KB (K b ) and query dataset D q from two standard KB datasets: (1) WordNet (Bordes et al., 2013) and (2) Nell (Gardner et al., 2014). From each KB dataset, we first build a fairly large triple store and use it as the original KB (K org ) and then, create K u of user, base KB (K b ) of CILK and D q from K org , as discussed below (Table 1 shows the results).
Simulated User, Base KB Creation and Query Dataset Generation. In Nell, we found 150 relations with ≥ 300 triples, and we randomly selected 25 relations for D q . We shuffle the list of 25 relations, select 34% of them as unknown relations and consider the rest (66%) as known relations.
For each known relation r, we randomly shuffle the list of distinct triples for r, choose (maximum) 250 triples and randomly select 20% as test and add a randomly chosen subset of the rest of the triples along with the leftovers (not in the list of 250), into K b and the other subset are added to K u (to provide supporting facts involving poorly learned known relations and/or entities, if asked [see Sec 3.3]).
For each unknown relation r, we remove all triples of r from K org , randomly choose 20% triples among them and reserve them as query triples for unknown r. Rest 80% triples of unknown r are added to K u (for providing clues). In this process, we also make sure that the query instances involving unknown r are excluded from K u . Thus, the user cannot provide the query triple itself as a clue to CILK (during inference) and also, to simulate the case that the user does not know the answer of its issued query. Note, if the user cannot provide a clue for an unknown query relation or a fact for an unknown query entity (not likely), CILK will not be able to correctly answer the query.
At this point, D q consists of query triples involving both known and unknown relations, but all known entities. To create queries in D q having unknown entities, we randomly choose 20% of the entities in D q triples, remove all triples involving those entities from K org and add them to K u . Now, K org gets reduced to K b (base KB). Next, for each query triple (h, r, t) ∈ D q , we convert the triple into a head query q =(?, r, t) [or a tail query q =(h, r, ?)] by randomly deleting the head or tail entity. We also collect the answer set for each q ∈ D q based on observed triples in K org for CILK evaluation. Note, the generated query triples (with answer entity) in D q are not directly in K b or K u .
The WordNet dataset being small, we use all its 18 relations for creating D q , K u , K b following Nell. As mentioned earlier, the triples in K b are randomly split into 90% training and 10% validation datasets for simulating initial training phase of CILK. Hyper-parameter Settings. Embedding dimensions of entity and relations are empirically set as 250 for WordNet and Nell, initial training epochs N init as 100 for WordNet (140 for Nell), training batch size 128, N tr as 500, |D r vd ∪ D e vd | as 50, α = 0.9, ρ = 20%, random seed as 1000, 4 negative triples generated per positive triple, online training epoch as 5 (2) for each closed (open) world query processing, and learning rate 0.001 for both KB datasets. L2-regularization parameter set as 0.001. Adam optimizer is used for optimization. Compared Models. Since there is no existing work that solves our proposed problem, we compare various versions of CILK, constructed based on different types of prediction threshold µ q for query rejection (Sec. 3.2) and various online training D tr = (D r tr ∪ D e tr ) and validation dataset D vad = (D r vd ∪D e vd ) sampling strategies [see Line 13-14 of Algorithm 1] as discussed below: • CILK variants based on prediction threshold types, namely EntTh-BTr, RelTh-BTr, MinTh-BTr and MaxTh-BTr (see Table 2). For EntTh-BTr, (Threshold) variants denoted ase"X-BTr" and last three (dataset sampling strategy) variants denoted as "MaxTh-X" and marked the highest H@1 and H@10 values (among each of the groups of four and three) in bold. Thus, some columns have at max. two values marked bold (due to the two comparison groups). MaxTh-BTr in the table is the version of CILK proposed in Sec. 3.   • CILK variants based on dataset sampling strategies: MaxTh-BTr (as explained above), MaxTh-EntTr and MaxTh-RelTr (see Table 2). Given the query entity e and query relation r, MaxTh-EntTr only samples triples involving e and MaxTh-RelTr samples only triples involving r to build D tr and D vd . Note, if the sampled dataset D tr (D vd ) is ∅, CILK skips online training (validation) steps for that session. Evaluation Metrics. We use two common KBE evaluation metrics: mean reciprocal rank (MRR) and Hits@k (H@k). MRR is the average inverse rank of the top ranked true answer entity for all queries (Bordes et al., 2013). Hits@k is the proportion of test queries for which the true answer entity has appeared in top-k (ranked) predictions. Higher MRR and Hits@k indicate better performance.

Results and Analysis
For evaluation on a given KB (WordNet or Nell), we randomly generate a chronological ordering of all query instances in D q , which are fed to the trained CILK (after the initial training phase is over) in a streaming fashion, and then evaluate CILK on the overall query dataset. The avg. test query processing time of CILK is 1.25 sec (on a Nvidia Titan RTX GPU). While evaluating a query q j , if the true answer of q j does not exist in KB K j+1 and M j+1 rejects q j , we consider it as a correct prediction. For such q j , Reciprocal Rank (RR) cannot be computed. Thus, we exclude q j while computing MRR, but consider it in computing Hits. Table 2 shows the performance of CILK variants on the query dataset, evaluated in terms of MRR, H@1 and H@10 for both KBs. We present the overall result on the whole query dataset as well as results on subsets of query datasets, denoted as (Rel-X, Ent-Y), where X and Y can be either known ('K') or unknown ('UNK') and 'Rel' denotes query relation and 'Ent' denotes query entity. So, here, (Rel-K, Ent-UNK) denotes the subset of the query dataset that contains query triples involving only known query relations and unknown query entities (with respect to K b ). For all variants, we fix the maximum number of clue triples and entity fact triples provided by the simulated user for each query (when asked) as 1 and 3 respectively.
From Table 2, we see that, MaxTh-BTr (version of CILK in Sec. 3) achieves the overall best results compared to other variants for both KB datasets. Among different threshold versions, MaxTh-BTr and MinTh-BTr perform better than the rest. The relatively poor result of RelTh-BTr shows threshold strategy plays a vital role in performance improvement. Considering different dataset sampling strategies, again we see MaxTh-BTr performs better than other versions. As the triples involving both query entity and relation are selected for online training in MaxTh-BTr, CILK gets specifically trained on relevant (queryspecific) triples before the query is answered. For other variants, either triples involving query relation (for MaxTh-EntTr) or triples involving query entity (for MaxTh-RelTr) are discarded, causing a drop in performance.
In Table 3, we compare different CILK threshold variants based on how often it predicts (or rejects) the query, when the true answer exists (does not exist) in its current KB, given by Pr(pred | AE) [ Pr(Reject | ¬AE) ]. For both datasets, EntTh-BTr has a tendency to predict more and reject less. Whereas, RelTh-BTr is more precautious in prediction. MinTh-BTr is the least precautious in prediction among all. MaxTh-BTr adopts the best of both worlds (EntTh-BTr and RelTh-BTr), showing moderate strategy in prediction and rejection behavior. Table 4 shows comparative performances of MaxTh-BTr on varying the maximum number of clue triples and entity fact triples provided by the user (when asked). Comparing (1, 1), (1, 2), (1, 3) we see a clear performance improvement in MaxTh-BTr with the increase in (acquired) entity fact triples (specially, for WordNet). This shows that if user interacts more and provides more infor-mation for a given query, CILK can gradually improve its performance over time [i.e., with more accumulated triples in its KB]. For Nell, performance improves for both (1, 2) and (1, 3) compared to that in (1, 1), (1, 2) variant being the best overall. Comparing (1, 3) and (2, 2) for both KBs, we see that acquiring more entity facts dominates the overall performance improvement compared to acquiring more clues. This is because, as a past query relation is more probable to appear in future query compared to a past query entity, CILK can gradually learn the relation embedding with less clues per query unlike that for an entity. (1, 3)-U denotes the set up, where CILK asks for clues or entity facts only if the query triple has unknown entity and/or relation, i.e. we disable the use of performance buffer P (see Sec 3.3). Due to lack of sufficient training triples to learn an unknown query relation and entity, the overall performance degrades. This shows the importance and effectiveness of the performance buffer in improving performance of CILK with limited user interactions.
In Table 5, we show the performance of MaxTh-BTr on (predicted) test queries over time. Considering overall performance, the improvement is marginal. However, for open-world queries, there is a substantial improvement in performance as CILK relatively acquires more facts for openworld queries than that of closed-world ones.

CILK: Use Cases in Dialogue Systems
There are many applications for CILK. Conversational QA systems (Kiyota et al., 2002;Bordes et al., 2014), conversational recommendation systems (Anelli et al., 2018;, information-seeking conversational agents , etc., that deal with real-world facts, are all potential use cases for CILK.
Recently,  showed that dialogue models augmented with commonsense facts improve dialogue generation performance. It's quite apparent that continuous knowledge learning using CILK can help these models grow their KBs over time and thereby, improve their response generation quality.
The proposed version of CILK has been designed based on a set of assumptions (see Sec. 1) to reduce the complexity of the modeling. For example, we do not handle the case of intentional or unintentional false knowledge injection by users to corrupt the system's KB. Also, we do not deal with fact extraction errors of the peripheral information extraction module or query parsing errors of the semantic parsing modules, which can affect the knowledge learning of CILK. We believe these are separate research problems and are out of the scope of this work. In future, we plan to model an end-to-end approach of knowledge learning where all peripheral components of CILK can be jointly learned with CILK itself. We also plan to solve the cold start problem when there is little training data for a new relation when it is first added to the KB.
Clearly, CILK does not learn all forms of knowledge. For example, it does not learn new concepts and topics, user traits and personality, and speaking styles. They also form a part of our future work.

Conclusion
In this paper, we proposed a continuous (or lifelong) and interactive knowledge learning engine CILK for dialogue systems. It exploits the situation when the system is unable to answer a WHquestion from the user (considering its existing KB) by asking the user for some knowledge and based on it to infer the query answer. We evaluated the engine on two real-world factual KB data sets and observed promising results. This also shows the potentiality of CILK to serve as a factual knowledge learning engine for future conversational agents.