Bridging the Gap between Prior and Posterior Knowledge Selection for Knowledge-Grounded Dialogue Generation

Knowledge selection plays an important role in knowledge-grounded dialogue, which is a challenging task to generate more informative responses by leveraging external knowledge. Recently, latent variable models have been proposed to deal with the diversity of knowledge selection by using both prior and posterior distributions over knowledge and achieve promising performance. However, these models suffer from a huge gap between prior and posterior knowledge selection. Firstly, the prior selection module may not learn to select knowledge properly because of lacking the necessary posterior information. Secondly, latent variable models suffer from the exposure bias that dialogue generation is based on the knowledge selected from the posterior distribution at training but from the prior distribution at inference. Here, we deal with these issues on two aspects: (1) We enhance the prior selection module with the necessary posterior information obtained from the specially designed Posterior Information Prediction Module (PIPM); (2) We propose a Knowledge Distillation Based Training Strategy (KDBTS) to train the de-coder with the knowledge selected from the prior distribution, removing the exposure bias of knowledge selection. Experimental results on two knowledge-grounded dialogue datasets show that both PIPM and KDBTS achieve performance improvement over the state-of-the-art latent variable model and their combination shows further improvement.


Introduction
Knowledge-grounded dialogue (Ghazvininejad et al., 2018) which leverages external knowledge to generate more informative responses, has become a popular research topic in recent years. Many researchers have studied how to effectively leverage the given knowledge to enhance dialogue understanding and/or improve dialogue generation Context I just got a husky puppy.

Knowledge Pool
Husky is a general name for a sled type of dog used in northern regions, differentiated from other sled-dog types by their fast pulling style. 1 Huskies are also today kept as pets, and groups work to find new pet homes for retired racing and adventure trekking dogs. 2 Huskies are used in sled dog racing. 3 The use of "husk" is recorded from 1852 for dogs kept by Inuit people ... ...

L
Child of the Wolves is a children's novel, published in 1996, about a Siberian husky puppy that joins a wolf pack.
Response a It sounds cute! Huskies are known amongst sled-dogs for their fast pulling style.

Response b
It sounds cute! I have read a novel about a husky puppy joining a wolf pack. Is your husky puppy wolf-like in appearance? Table 1: An example shows the diversity of knowledge selection in knowledge-grounded dialogue. Here, we show two different responses with two possible selection decisions. For the same context, there may be diverse knowledge sentences to generate different responses which help their selection decisions in turn.
The prior knowledge selection only depends on context while the posterior knowledge selection means selection with context and response (Lian et al., 2019). (Zhao et al., 2019b;Sun et al., 2020;Madotto et al., 2018;Yavuz et al., 2019;Tang and Hu, 2019;Li et al., 2019;Zheng and Zhou, 2019;Ye et al., 2020). However, they usually use the pre-identified knowledge (Zhang et al., 2018;Moghe et al., 2018; which is not available in some real-world scenarios. And others leverage the retrieval system to get the knowledge which may contain noisy and irrelevant data (Chaudhuri et al., 2018;Parthasarathi and Pineau, 2018;Zhou et al., 2018;Gopalakrishnan et al., 2019). Recently, Dinan et al. (2019) propose to decompose this task into two subproblems: knowledge selection and response generation. This pipeline framework has been widely used for the open domain setting (Chen et al., 2017;Min et al., 2018;Nie et al., 2019) and shows promising performance with explicit use of knowledge in knowledge-grounded dialogue (Dinan et al., 2019). Knowledge selection plays an important role in open-domain knowledge-grounded dialogue (Di-nan et al., 2019) since the inappropriate knowledge selection may prevent the model from leveraging the knowledge accurately (Lian et al., 2019), or even lead to an inappropriate response. The example in Table 1 shows two phenomena: (1) There may exist one-to-many relations between the dialogue context and the knowledge, resulting in the diversity of knowledge selection (Kim et al., 2020); (2) The posterior knowledge selection with context and response is much easier than the prior knowledge selection only depending on context. It is rather intuitive that we can dramatically reduce the scope of knowledge selection when we know the knowledge contained in the response, while such posterior information is not available at inference. Recently, latent variable models (Lian et al., 2019;Kim et al., 2020), using the posterior distribution to guide the prior distribution, have been proposed to deal with the diversity of knowledge selection. They can jointly model knowledge selection with response generation and achieve promising performance. Despite their success, latent variable models suffer from a huge gap between prior and posterior knowledge selection as discussed below.
We analyze the gap in latent variable models on two aspects: (1) Compared with the posterior selection module, the prior selection module has no access to the necessary posterior information. As a result, it is hard for the prior distribution to approximate the posterior distribution correctly at training, so that the prior selection module may not select knowledge properly at inference. (2) Response generation of latent variable models is based on the knowledge selected from the posterior distribution at training but from the prior distribution at inference. This discrepancy, also named exposure bias, leads to a gap between training and inference (Ranzato et al., 2015;Zhang et al., 2019;Zhao et al., 2019a), and therefore the decoder may have to generate a response with inappropriate knowledge selected from an unfamiliar prior distribution.
In this paper, we propose to bridge the gap between prior and posterior knowledge selection for knowledge-grounded dialogue generation. Firstly, we enhance the prior selection module with the necessary posterior information which is obtained by the specially designed Posterior Information Prediction Module (PIPM). Secondly, inspired by knowledge distillation (Hinton et al., 2015), we design a Knowledge Distillation Based Training Strategy (KDBTS) to train the decoder with the knowledge selected by the prior module, removing the exposure bias of knowledge selection. Experimental results show that both PIPM and KDBTS bring performance improvement on two knowledge-grounded dialogue datasets, i.e., Wizard of Wikipedia (Dinan et al., 2019) and Holl-E (Moghe et al., 2018). And the combination of PIPM and KDBTS obtains the new state-of-the-art performance with further improvement.
Our contributions are summarized as follows: 1 • We clearly point out the gap between prior and posterior knowledge selection of latent variable models and propose to enhance the prior selection module with the necessary posterior information. Moreover, we explore several variants of posterior information.
• We focus on the exposure bias of knowledge selection for knowledge-grounded dialogue, and design a knowledge distillation based training strategy to deal with it.
• Experimental results show that both PIPM and KDBTS bring performance improvement, and their combination achieves the state-of-the-art performance with further improvement.

Task Formulation
For a dialogue with T turns, each turn is a pair of message x t and response y t . Besides, each turn is associated with a knowledge pool K t = {k l t } = {k 1 t , · · · , k L t } consisting of L sentences. The context ctx t consists of the dialogue history hist t = [x 1 , y 1 , · · · , x t−1 , y t−1 ] and the message x t .
Given the context ctx t , we firstly select a knowledge sentence k sel t ∈ K t from the knowledge pool, then leverage the selected knowledge to generate an informative response y t . The selection history at the t-th turn is kh t = [k sel 1 , · · · , k sel t−1 ].

Latent Knowledge Selection For Response Generation
To obtain the likelihood of response y t , latent variable models treat knowledge k as the latent variable and marginalize over all possible knowledges K t : where p θ (y t |ctx t , k) is the decoder network, and π θ (K t ), short for π θ (K t |ctx t , kh t ), is the prior distribution over the knowledge pool K t based on the context ctx t and selection history kh t . The evidence lower bound (ELBO) is written as: where q φ (K t ), short for q φ (K t |ctx t , y t , kh t ), is an inference network to approximate the true posterior distribution p (K t |ctx t , y t , kh t ).
The Gap between Prior and Posterior Knowledge Selection. Firstly, compared with the posterior selection module, the prior selection module has no access to the posterior information as shown in Figure 1. As a result, it is hard for the prior distribution to approximate the posterior distribution correctly by minimizing the KL divergence in Equation 2 at training. Hence, the prior selection module may not select knowledge properly at inference. Secondly, comparing the two expectation terms in Equation 1 and 2, we see that the selected knowledge for response generation at training and inference is drawn from different distributions, i.e., the posterior distribution k ∼ q φ (K t ) at training and the prior distribution k ∼ π θ (K t ) at inference. Figure 1 clearly shows this discrepancy which will cause the decoder to have to generate with knowledge selected from the unfamiliar prior distribution. These issues lead to the gap between prior and posterior knowledge selection, which we try to deal with in this paper.

Sequential Knowledge Transformer
Recently, Kim et al. (2020) propose Sequential Knowledge Transformer (SKT), the state-of-theart latent variable model for knowledge selection. Here, we briefly describe SKT, based on which we validate the effectiveness of our approach.
Sentence Encoding. For any sentence sent t with N w words at t-th turn, SKT uses a shared BERT (Devlin et al., 2019) to obtain the context aware word representations H sent t with d dims and then converts them into the sentence representation h sent t by mean pooling (Cer et al., 2018): As a result, we obtain H x post t are obtained: Note that s kh t−1 summarizes the selection history kh t = [k sel 1 , · · · , k sel t−1 ], and s hist Finally, the knowledge k sel t is selected by sampling from the posterior distribution q φ (K t ) = a post t (K t ) at training while selected with the highest probability over the prior distribution π θ (K t ) = a prior t Generation with Knowledge. SKT takes the concatenation of message x t and selected knowledge sentence k sel t as input and generates responses by the Transformer decoder (Vaswani et al., 2017) with copy mechanism . Though there are various models studying how to improve the generation quality based on the given knowledge, here, we simply follow the decoder of SKT and mainly focus on the knowledge selection issue.

Approach
In this section, we show how to bridge the gap between prior and posterior knowledge selection in knowledge-grounded dialogue. Firstly, we design the Posterior Information Prediction Module (PIPM) to enhance the prior selection module with the necessary posterior information. Secondly, we design a Knowledge Distillation Based Training Strategy (KDBTS) to train the decoder with the knowledge selected from the prior distribution, removing the exposure bias of knowledge selection.

Posterior Information Prediction Module
As shown in Figure 2, we design a Posterior Information Prediction Module (PIPM) to predict the necessary posterior information. The main motivation is that we want to enhance the prior selection module with the necessary posterior information, so that it could approximate the posterior distribution better at training and leverage the posterior information for knowledge selection at inference. Following the typical setting in latent variable models (Lian et al., 2019;Kim et al., 2020), we use the response in bag-of-words (BOW) format (Zhao et al., 2017) as the posterior information. Here we take the dialogue context and the knowledge pool as input to generate the posterior information.
We firstly summarize the context as the query of this module q PI t = s hist t−1 ; h x t and use it to get the attention distribution a PI t (K t ) over the knowledge pool K t by Equation 6. Then, we summarize the knowledge representation in the knowledge pool with the weights in a PI t (K t ) considered: Secondly, we take the summarization of the dialogue context and the knowledge pool as input and use a position wise feed-forward network (FFN) (Vaswani et al., 2017) to generate the posterior informationÎ t in BOW format as: Finally, we use the generated posterior information I t to obtain the updated prior queryq post t as: where E ∈ R |V |×d is the embedding matrix and |V | is the vocabulary size. Compared with q . We supervise this module by an addition loss L PIPM with the grounded posterior information I, i.e., the bag of tokens in the golden response: Note that we remove the context information from I because this information is already contained in the prior query q prior t .

Knowledge Distillation Based Training Strategy
Current latent variable models (Lian et al., 2019;Kim et al., 2020) suffer from the exposure bias of knowledge selection as shown in Figure 1. Therefore, the decoder may have to generate a response with inappropriate knowledge selected from an unfamiliar prior distribution. Inspired by knowledge distillation (Hinton et al., 2015), we design the Knowledge Distillation Based Training Strategy (KDBTS) to deal with this exposure bias. KDBTS is a two-stage training strategy that we firstly train the posterior selection module as the teacher and then leverage the well-trained posterior module to train the prior selection module as the student.
First Training Stage. We train a teacher at this stage, which is used to guide the student at the next stage. We can obtain a teacher as the by-product, i.e., the posterior selection module, from the training process of latent variable models in Figure 1. However, we find the posterior selection module is affected by the prior distribution when minimizing the KL term in Equation 2, and experiments in Section 5.2 show that it is usually not good enough. As a result, we introduce a "fix" operation to make sure that the posterior selection module can not be affected by the prior distribution, and replace the KL term L KL with the fixed KL term L fix KL : The total loss for training the teacher is as follows: where L post NLL is defined in Equation 2, k a t ∈ K t is the golden selected knowledge for knowledge loss − log q φ (k a t ) which is proposed by (Kim et al., 2020) and λ is a hyperparameter.
Second Training Stage. Once we obtain the teacher, we could leverage the posterior distribution from the well-trained teacher to deal with the diversity of knowledge selection. At this training stage, we feed the knowledge selected by the prior module into the decoder as shown in Figure 2, which is the same as the inference process. In this way, KDBTS removes the exposure bias of knowledge selection. Here, we only update the prior selection module and the decoder (the green blocks in Figure 2) because the encoder is shared by the student and the teacher. The total loss for training the student at this stage is: where L prior NLL = −E k∼π θ (Kt) [log p θ (y t |ctx t , k)] and − log π θ (k a t ) is the knowledge loss. Compared with L post NLL defined in Equation 2, L prior NLL optimizes the decoder with knowledge selected from the prior distribution. Figure 2 clearly shows that KDBTS removes the exposure bias of knowledge selection by feeding the knowledge selected from prior distribution into the decoder.

Dataset
We adopt two multi-turn knowledge-grounded dialogue datasets for experiments. Wizard of Wikipedia (Dinan et al., 2019) contains 18, 430 training dialogues, 1, 948 validation dialogues and 1, 933 test dialogues on 1365 topics. And test set is split into two subsets according to topics, which are Test Seen with 965 dialogues and Test Unseen with 968 dialogues whose topics are never seen in training and validation set. There are about 61 sentences on average in the knowledge pool per turn, which are retrieved from Wikipedia based on the context. Holl-E (Moghe et al., 2018) contains 7, 228 training dialogues, 930 validation dialogues and 913 test dialogues. There are two versions of the test set: one with a single golden reference, the other with multiple golden references. Each dialogue is assigned with a document of about 60 sentences on average as the knowledge pool. Here, we use the modified version (Kim et al., 2020) which fits for knowledge selection.

Baseline Models
We compare our approach with a set of competitive baselines: TMN is short for E2E Transformer MemNet (Dinan et al., 2019). TMN uses a Transformer memory network for knowledge selection and a Transformer decoder with copy mechanism for utterance prediction. Knowledge selection is trained based on the knowledge label without posterior distribution. TMN BERT , is short for TMN+BERT, implemented by (Kim et al., 2020). TMN BERT replaces the Transformer memory network with a pre-trained BERT. PostKS (Lian et al., 2019) only uses the posterior knowledge distribution as a pseudo-label for knowledge selection. PostKS uses GRU-based encoder and decoder without copy mechanism and does not use the knowledge label at the training stage. TMN BERT+PostKS+CP is short for TMN+BERT+ PostKS+Copy, implemented by (Kim et al., 2020). Compared with TMN BERT , it additionally uses the the posterior distribution in PostKS. SKT (Kim et al., 2020) is the current state-ofthe-art latent variable model. Compared with TMN BERT +PostKS cp , SKT leverages the posterior distribution by sequential latent modeling.

Implementation Details
We validate the effectiveness of our approach based on current state-of-the-art model SKT (Kim et al., 2020), using the same datasets and dataprocessing codes 3 . A shared encoder initialized with BERT BASE (Devlin et al., 2019) is used to encode dialogue and knowledge sentences. Then, a 5-layer Transformer decoder with copy mechanism is used to generate the response. We use a FFN with 512 hidden dims to generate the posterior information in BOW formats. The hidden size d is 768 and the vocabulary size |V | is 30, 522. Each batch consists of dialogues rather than individual turns, and the batch size is 1. The hyperparameter λ in Equation 13 is set to 0.5 for all experiments without searching. The "fix" operation in Equation 11 is implemented by gradient stoping.
Models are trained end-to-end using the Adam optimizer (Kingma and Ba, 2014) with gradient clipping at 0.4 and the learning rate is 0.00002. And we apply label smoothing (Pereyra et al., 2017) and set 0.1 for knowledge selection and 0.05 for response generation. We approximate the expectation in Equation 1 and 2 by drawing one sample with Gumbel-Softmax function (Jang et al., 2016) with temperature τ = 0.1. We train our teacher with 5 epochs, then select the teacher to teach the student according to the prior knowledge selection accuracy rather than posterior selection accuracy, because the shared encoder and decoder may be optimized overly for the posterior selection module, which is not a good initialization for the student. We train other models 20 epoches and select them according to the R-1 score as the final goal is to generate high-quality responses.

Evaluation
Automatic Evaluation. Following (Dinan et al., 2019;Kim et al., 2020), we use accuracy (Acc) to evaluate the knowledge selection and use perplexity (PPL), unigram F1 (R-1) and bigram F1 (R-2) to evaluate the quality of response generation automatically. Following (Kim et al., 2020), we remove all the punctuation and (a, an, the) before computing the R-1 and R-2 scores. Note that lower perplexity and higher R-1 and R-2 indicate better generation quality.
Human Evaluation. We firstly select 100 samples from each test set on the Wizard of Wikipedia dataset for human evaluation. Then, we ask three annotators to judge whether the response makes sense (Sensibleness) or is specific (Specificity) with the dialogue context. Finally, we obtain Sensibleness and Specificity Average (SSA) scores, which could penalize boring and vague responses and is suitable for the goal of knowledge-grounded dialogue (Adiwardana et al., 2020). Moreover, compared with Engagingness and Knowledgeability used in (Dinan et al., 2019;Kim et al., 2020), SSA, evaluated in 0/1 format, is more objective and easier to conduct (Adiwardana et al., 2020).

Main Results
Quantitative Results. We report automatic results on the Wizard of Wikipedia dataset in Table 2 and we have the following observations: (1) From row 4 and 5 we can see that the PIPM indeed provides some necessary posterior information which is helpful for knowledge selection. (2) Comparing row 4 and 6, we see that the KDBTS brings about significant improvement on generation quality by removing the exposure bias of knowledge selection.
(3) The combination of PIPM and KDBTS achieves further improvement on most metrics, except the knowledge selection accuracy. We think the reason is that there may exist several similar knowledge sentences leading to the same response. As a result, SKT+PIPM+KDBTS may select a reasonable    knowledge but not the golden one to generate informative and fluent responses, because models are selected according to the generation quality rather than the selection accuracy (Kim et al., 2020).
Results in Table 3 lead to consistent observations on the Holl-E dataset that both PIPM and KDBTS bring performance improvement over the state-ofthe-art latent variable model SKT, and their combination achieves further improvement, resulting in the new state-of-the-art performance.
Qualitative Results. We report human evaluation results of the generated responses in Table 4. We see that our approach brings about consistent improvement over the state-of-the-art model SKT. Our approach can leverage the selected knowledge appropriately to generate more sensible and spe-Name Details I = y BOW bag-of-word information in the response, i.e., the words.
BOW -x BOW I = y Seq the response Table 5: Several variants of posterior information.
cific responses, which are fluent and informative.

Analysis
PIPM. We explore several variants of posterior information in Tabel 5 to better study this module. Besides the default y x BOW , we also explore two variants in BOW format: (1) y BOW does not remove the context information; (2) yk x BOW additionally considers another source of posterior information, i.e., the golden selected knowledge k a t . And we consider the sequence information in y Seq , as the BOW format discards the word order. Note that we use the FFN in Equation 8 to obtain the posterior information in BOW formats, and we take H x t +h PI t as input and use a 3-layer Transformer decoder to generate y Seq . Moreover, we also perform an ablation study to investigate the effectiveness  of generated posterior information for prior knowledge selection. We remove the predicted posterior informationÎ in Equation 9, but still use L PIPM in Equation 10 for comparison. These results are reported in the upper part (row 1 ∼ 5) of Table 6 and the observations are stated as follows: (1) Compared with SKT in row 0, the variants of posterior information in row 1 ∼ 4 bring improvement on selection accuracy, though some generation metrics are slightly lower because of the exposure bias of knowledge selection. (2) From row 1 and 2, we see that the sequence information in y Seq contributes to the knowledge selection. However, it is inefficient to generate y Seq word by word, and y Seq is not better than y x BOW in row 4. And there is no significant difference between y x BOW and yk x BOW which combines another source of posterior information.
(3) We report the ablation result in row 5 to investigate the effectiveness of generated posterior information for knowledge selection. We see that this information improves the selection accuracy (compared with row 4).

KDBTS.
We investigate the training strategy in the lower part (row 6 ∼ 9) of Table 6 and the observations are stated as follows: (1) Comparing row 6 and 7, we see that SKT (KLfix) † is a good teacher with a much higher selection accuracy because L fix KL in Equation 11 guarantees the posterior selection module not affected by the prior distribution. When using L KL in Equation 2, SKT † achieves the lower accuracy (32.8 vs 52.0) and the lower KL divergence (0.31 vs 1.41), which indicates that the posterior module is affected by the prior distribution.
(2) Despite doing well in knowledge selection, SKT (KLfix) in row 8 performs worse in generation than SKT. Because SKT (KLfix) has a larger KL divergence (1.41 vs 0.31) than SKT 4 , it has to generate the response with the knowledge selected from a much more unfamiliar prior distribution.
(3) As our KDBTS in row 9 does not suffer from the exposure bias of knowledge selection, the generation quality is improved significantly. We are amazed at finding that the selection accuracy is also improved and we argue that the KDBTS drives the prior distribution to approximate the posterior distribution at most by minimizing the KL term with the posterior distribution fixed. Table 7 shows two example response on the Wizard of Wikipedia dataset. We see that the posterior informationÎ, e.g., "rock band" and "formed", provided by the PIPM, is helpful for knowledge selection, although there is some noise information, e.g., "american". And quantitative results in Section 5.1 show that this module is effective although not perfect. We see that the responses generated by SKT have repeated words in case 1 and do not reply directly in case 2. Our method can leverage the selected knowledge sufficiently to generate a fluent and informative response because our KDBTS removes the exposure bias of knowledge selection by training the decoder with knowledge selected in the same way as the inference process.

Case Study
Case 1 Context I used to listen to the rolling stones a lot when i was a child.

Golden Knowledge
The rolling stones are an English rock band formed in london, england in 1962.

Human
Me too. I can't believe they formed in London as far back as 1962! what's your favorite song? SKT The rolling stones are an English rock band formed in London in London.

Ours
Posterior Information in BOW format (Top 20) [PAD] in the rock band , they formed is are ? of a you i ' ! from and american Response I love the rolling stones! They are an English rock band formed in 1962.

Context
How do teams score points? Golden Knowledge Points are primarily scored by advancing the ball into the opposing team's end zone for a touchdown or kicking the ball through the opponent.

Human
Points are scored by advancing the ball into the opposing teams end zone for a touchdown or kicking it through the opponents goalposts for a field goal. good question! SKT The object is to score by getting the ball into the opposing goal.

Ours
Posterior Information in BOW format (Top 20) the players [PAD] of , is goal a ball and to team on it each in score points most field

Response
Points are scored by advancing the ball into the opposing team's end zone.

Related Work
We mainly focus on knowledge selection in open domain knowledge-grounded dialogue, and there are several work studying this issue (Dinan et al., 2019;Lian et al., 2019;Kim et al., 2020). Here, we point out the gap between prior and posterior knowledge selection and try to deal with this gap. The PIPM has some relations with several work. Deliberation Decoder (Xia et al., 2017;Li et al., 2019) leverages two decoders for two-pass generation. We also use two decoders in PIPM (y Seq ), but the first decoder is used to generate posterior information for the knowledge selection rather than the second-pass generation. BOW loss, proposed by (Zhao et al., 2017), is adopted to supervise the posterior module (Lian et al., 2019). Here, we have three different aspects: (1) We use the BOW loss for the prior module rather than the posterior module; (2) We use posterior information in BOW format to enhance the prior selection module; (3) We explore several BOW variants.
Our KDBTS is inspired by knowledge distillation (Hinton et al., 2015). Instead of using more complex structure as the teacher for model compression, we treat the posterior selection module with additional input information (e,g., the response) as the teacher, and deal with the exposure bias of knowledge selection. Lite ELBO (Zhao et al., 2019a) is proposed to remove the exposure bias at latent space for a different task. However, Lite ELBO naturally does not leverage the posterior distribution as it sets the posterior module the same as the prior module. Our KDBTS is a two-stage training strategy that uses the posterior distribution as the teacher to guide the prior selection module and uses the knowledge selected from the prior distribution to train the decoder.

Conclusion
In this paper, we firstly analyze the gap between prior and posterior knowledge selection for opendomain knowledge-grounded dialogue. Then, we deal with it on two aspects: (1) We enhance the prior selection module with the posterior information obtained by the PIPM and we explore several variants of posterior information.
(2) We design the KDBTS to train the decoder with knowledge selected from the prior distribution, removing the exposure bias of knowledge selection. Experiments show that both PIPM and KDBTS improve the state-of-the-art latent variable model and their combination achieves further improvement. In the future, we would explore three aspects: (1) more efficient posterior information representation and corresponding prediction module, (2) the interpretability of knowledge selection and (3) knowledge selection without knowledge label.