KLearn: Background Knowledge Inference from Summarization Data

The goal of text summarization is to compress documents to the relevant information while excluding background information already known to the receiver. So far, summarization researchers have given considerably more attention to relevance than to background knowledge. In contrast, this work puts background knowledge in the foreground. Building on the realization that the choices made by human summarizers and annotators contain implicit information about their background knowledge, we develop and compare techniques for inferring background knowledge from summarization data. Based on this framework, we define summary scoring functions that explicitly model background knowledge, and show that these scoring functions fit human judgments significantly better than baselines. We illustrate some of the many potential applications of our framework. First, we provide insights into human information importance priors. Second, we demonstrate that averaging the background knowledge of multiple, potentially biased annotators or corpora greatly improves summaryscoring performance. Finally, we discuss potential applications of our framework beyond summarization.


Introduction
Summarization is the process of identifying the most important information pieces in a document. For humans, this process is heavily guided by background knowledge, which encompasses preconceptions about the task and priors about what kind of information is important (Mani, 1999).
Despite its fundamental role, background knowledge has received little attention from the summarization community. Existing approaches largely focus on the relevance aspect, which enforces similarity between the generated summaries and the source documents (Peyrard, 2019).  (2019), S is similar to D (Relevance measured by a small KL(S||D)) but also brings new information compared to background knowledge (informativeness measured by a large KL(S||K)). We can infer the unobserved K from the choices unexplained by the Relevance criteria.
In previous work, background knowledge has usually been modeled by simple aggregation of large background corpora. For instance, using TF·IDF (Sparck Jones, 1972), one may operationalize background knowledge as the set of words with a large document frequency in background corpora.
However, the assumption that frequently discussed topics reflect what is, on average, known does not necessarily hold. For example, commonsense information is often not even discussed (Liu and Singh, 2004). Also, information present in background texts has already gone through the importance filter of humans, e.g., writers and publishers. In general, a particular difficulty preventing the development of proper background knowledge models is its latent nature. We can only hope to infer it from proxy signals. Besides, there is, at present, no principled way to compare and evaluate background knowledge models.
In this work, we put the background knowledge in the foreground and propose to infer it from summarization data. Indeed, choices made by human summarizers and human annotators provide implicit information about their background knowledge. We build upon a recent theoretical model of information selection (Peyrard, 2019), which postulates that information selected in the summary results from 3 desiderata: low redundancy (the summary contain diverse information), high relevance (the summary is representative of the document), and high informativeness (the summary adds new information on top of the background knowledge). The tension between these 3 elements is encoded in a summary scoring function θ K that explicitly depends on the background knowledge K. As illustrated by Fig. 1, the latent K can then be inferred from the residual differences in information selection that are not explained by relevance and redundancy. For example, the black information unit in Fig. 1 is not selected in the summary despite being very prominent in the source document. Intuitively, this is explained if this unit is already known by the receiver. To leverage this implicit signal, we view K as a latent parameter learned to best fit the observed summarization data.
Contributions. We develop algorithms for inferring K in two settings: (i) when only pairs of documents and reference summaries pairs are observed (Sec. 4.1) and (ii) when pairs of document and summaries are enriched with human judgments (Sec. 4.2).
In Sec. 5 we evaluate our inferred Ks with respect to how well the induced scoring function θ K correlates with human judgments. Our proposed algorithms significantly surpass previous baselines by large margins.
In Sec. 6, we give a geometrical perpespective on the framework and show that a clear geometrical structure emerges from real summarization data.
The ability to infer interpretable importance priors in a data-driven way has many applications, some of which we explore in Sec. 7. Sec. 7.1 qualitatively reveals which topics emerge as known and unkown in the fitted priors. Moreover, we can infer K based on different subsets of the data. By training on the data of one annotator, we get a prior specific to this annotator. Similarly, one can find domain-specific K's by training on different datasets. This is explored in Sec. 7.2, where we analyze 16 annotators and 15 different summarization datasets, yielding interesting insights, e.g., averaging several, potentially biased, annotator-specific or domain-specific K's results in systematic generalization gains.
Finally, we discuss future work and potential applications beyond summarization in Sec. 8. Our code is available at https://github.com/ epfl-dlab/KLearn

Related work
The modeling of background knowledge has received little attention by the summarization community, although the problem of identifying content words was already encountered in some of the earliest work on summarization (Luhn, 1958). A simple and effective solution came from the field of information retrieval, using techniques such as TF·IDF on background corpora (Sparck Jones, 1972). Similarly, Dunning (1993) proposed the loglikelihood ratio test to identify highly descriptive words. These techniques are known to be useful for news summarization (Harabagiu and Lacatusu, 2005). Later approaches include heuristics to identify summary-worthy bigrams (Riedhammer et al., 2010). Also, Hong and Nenkova (2014) proposed a supervised model for predicting whether a word will appear in a summary or not (using a large set of features including global indicators from the New York Times corpus) which can then serve as a prior of word importance. Conroy et al. (2006) proposed to model background knowledge by aggregating a large random set of news articles. Delort and Alfonseca (2012) used Bayesian topic models to ensure the extraction of informative summaries. Finally, Louis (2014) investigated background knowledge for update summarization with Bayesian surprise.
These ideas have been generalized in an abstract model of importance (Peyrard, 2019) discussed in the next section.

Background
This work builds upon the abstract model introduced by Peyrard (2019), whose relevant aspects we briefly present here.
Let T be a text and a function mapping a text to its semantic representation of the following form: The semantic representation is a probability distribution P over so-called semantic units {ω j } j≤n .
Many different text representation techniques can be chosen, e.g., topic models with topics as semantic units, or a properly renormalized semantic vector space with the dimensions as semantic units.
In the summarization setting, the source document D and the summary S are represented by probability distributions over the semantic units, P D and P S . Similarly, K, the background knowledge, is represented as a distribution P K over semantic units. 1 Intuitively, P K (ω j ) is high whenever ω j is known. A summary scoring θ K (S, D) (or simply θ K (S) since the document D is never ambiguous) can be derived from simple requirements: where RED captures the redundancy in the summary via the entropy H. REL reflects the relevance of the summary via the Kullback-Leibler (KL) divergence between the summary and the document. A good summary is expected to be similar to the original document, i.e., the KL divergence KL(S D) should be low. Finally, INF models the informativeness of the summary via the KL divergence between the summary and the latent background knowledge K.
The summary should bring new information, i.e., the KL divergence KL(S K) should be high.

The KLearn framework
As laid out, in our framework, texts are viewed as distributions over a choice of semantic units {ω j } j≤n . We aim to infer a general K as the distribution over these units that best explains summarization data. We consider two types of data: with and without human judgments.

Inferring K without human judgments
Assume we have access to a dataset {x i } of pairs of documents D i and their associated summaries Under the assumption that the S i are good summaries (e.g., generated by humans), we infer the background knowledge K that best explains the observation of these summaries. Indeed, if these summaries are good, we assume that information has been selected to minimize redundancy, maximize relevance and maximize informativeness.
Direct score maximization. A straightforward approach is to determine the K that maximizes the θ K score of the observed summaries. Formally, this corresponds to maximizing the function: where KL(P K) acts as a regularization term forcing K to remain similar to a predefined distribution P. Here, P can serve as a prior about what K should be. The factor γ > 0 controls the emphasis put on the regularization. A first natural choice for the prior P can be the uniform distribution U over semantic units. In this case, we show in Appendix B that maximizing Eq. 3 yields the following simple solution for K: With the choice γ ≥ 1, note that P K (ω j ) is always positive, as expected. This solution is fairly intuitive as it simply counts the prominence of each semantic unit in human-written summaries and considers the ones often selected as interesting, i.e., as having low values in the background knowledge. We denote this technique as MS|U to indicate the maximum score with uniform prior. Surprisingly, it does not involve documents, whereas, intuitively, K should be a function of both the summaries and documents. However, if such a simplistic model works well, it could be applied to broader scenarios where the documents may not even be fully observed. Alternatively, we can choose the prior P to be the source documents {D i }. Then, as shown in Appendix B, the solution becomes Here a conservative choice for γ to ensure the pos- . This model is also intuitive, as the resulting value of P K (ω j ) would be higher if ω j is prominent in the document but not selected in the summary. This is, for example, the case for the black semantic unit in Fig. 1. Furthermore, choosing D as the prior implies viewing the documents as the only knowledge available and makes a minimal prior commitment as to what K should be. We denote this approach as MS|D.
Probabilistic model. When directly maximizing the score of observed summaries, there is no guarantee that the scores of other, unobserved summaries remain low. A principled way to address this issue is to formulate a probabilistic model over the observations x i = (D i , S i ): where the partition function is computed over the set Summ(D i ) of all possible summaries of document D i . In practice, we draw random summaries as negative samples to estimate the partition function (4 negative samples for each positive). Then, K is learned to maximize the loglikelihood of the data via gradient descent. To enforce the constraint of K being a probability distribution, we parametrize K as the softmax of a vector k = [k 1 , . . . , k n ] of scalars. The vector k is trained with mini-batch gradient descent to minimize the negative log-likelihood of the observed data. This approach is denoted as PM.

Inferring K with human judgments
Next, we assume a dataset annotated with human judgments. Observations come in the form where h i is a human assessment of how good S i is as a summary of D i . We can use this extra information to enforce high-scoring (low-scoring) summaries to also have a high (low) θ K scores.
Regression. As a first solution, we propose regression, with the goal of minimizing the difference between the predicted θ K and the corresponding human scores on the training set. More formally, the task is to minimize the following loss: where a > 0 is a scaling parameter to put θ K and h i on a comparable range. To train K with gradient descent, we again parametrize K as the softmax of a vector of scalars (cf. Sec. 4.1). We denote this approach as HREG.
Preference learning. In practice, regression suffers from annotation inconsistencies. In particular, the human scores for some documents might be on average higher than for other documents, which easily confuses the regression. Preference learning (PL) is robust to these issues, by learning the relative ordering induced by the human scores (Gao et al., 2018). PL can be formulated as a binary classification task (Maystre, 2018), where the input is a pair of data points {(S i , D i , h i ), (S j , D j , h j )} and the output is a binary flag indicating whether S i is better than S j , i.e., h i > h j : where σ is the logistic sigmoid function and l can be, for example, the binary cross-entropy. Again, we perform mini-batch gradient descent to train k. We denote this approach as HPL.

Comparison of approaches
To compare the usefulness of various K's, we need a way to evaluate them. Fortunately, there is a natural evaluation setup: (i) plug K into θ K , the summary scoring function described by Eq. 2, (ii) use the induced θ K to score summaries S i , and (iii) compute the agreement with human scores h i . To distinguish between the algorithms introduced in Sec. 4, we adopt the following naming convention for scoring functions: if the background knowledge K was computed using algorithm A, we denote the corresponding scoring function by θ A ; e.g., θ HPL is the scoring function where K was inferred by HPL.
Data. We use two datasets from the Text Analysis Conference (TAC) shared task: TAC-2008 andTAC-2009. 2 They contain 48 and 44 topics, respectively. Each topic was summarized by about 50 systems and 4 humans. All system summaries and human-written summaries were manually evaluated by NIST assessors for readability, content selection with Pyramid (Nenkova and Passonneau, 2004), and overall responsiveness (Dang andOwczarzak, 2008a, 2009a). In this evaluation, we focus on the Pyramid score, as the framework is built to model the content selection aspect.
Semantic units. As in previous work (Peyrard, 2019), we use words as semantic units. In Sec. 7, we also experiment with topic models. However, different choices of text representations can be easily plugged in the proposed methods. Words have the advantage of being simple and directly comparable to existing baselines. Baselines. For reference, we report the summary scoring functions of several baselines: LexRank (LR) (Erkan and Radev, 2004) is a graph-based approach whose summary scoring function is the average centrality of sentences in the summary. ICSI (Gillick and Favre, 2009) scores summaries based on their coverage of frequent bigrams from the source documents. KL(S D) and JS(S D) (Haghighi and Vanderwende, 2009) measure divergences between the distribution of words in the summary and in the sources. JS divergence is a symmetrized and smoothed version of KL divergence. Additionally, we report the performance of choosing the uniform distribution for K (denoted θ U ) and an IDF-baseline where K is built from the document frequency computed using the English Wikipedia (denoted as θ IDF ). For reference, we report the performance of training and evaluating θ HPL on all data (denoted as Optimal). This measures the ability of HPL to fit the training data.
Results. Table 1 reports the 4-fold crossvalidation, averaged over all topics in both TAC-08 and TAC-09. The first column reports the Kendall's τ correlation between humans and the various summary scoring functions. The second column reports the mean rank (MR) of reference summaries among all summaries produced in the shared tasks, when ranked according to the summary scoring functions. Thus, lower MR is better. First, note that even techniques that do not rely on human judgments can significantly outperform previous baselines. The results of θ MS|D are particularly strong, with large improvements despite the simplicity of the algorithm. Indeed, θ MS|U and θ MS|D have a time complexity of O(n), where n is the number of topics and run much faster than any other algorithm (≈ 2 seconds on a single CPU to infer K from a TAC dataset). Despite being more principled, θ PM does not outperform θ MS|D .
Improvements over baseline are also obtained by HPL, which leverages the fine-grained information of human judgments. However, even without benefiting from supervision, MS|D performs similarly to HPL without significant difference. Also, as expected, the preference learning setup θ HPL is stronger and more robust than the regression setup θ HREG , which does not significantly outperform the uniform baseline θ U . Therefore, we use HPL when human judgments are available and MS|D when only documentsummary pairs are available.

A geometric view
Previously (see Fig. 1), we mentioned that a good K corresponds to a distribution such that the sum-mary S is different from K (KL(S K) is large) but still similar to the document D (KL(S D) is small). Furthermore, the regularization term in Eq. 3, with P = D enforcing small KL(D K), makes minimal commitment as to what K should look like, i.e., no a-priori information except the documents is assumed.
Viewing these distributions as points in Euclidean space, the optimal arrangement for S, D, and K is on a line with D in between S and K. Since human-written summaries S and documents D are given, inferring K intuitively consists in discovering the point in high-dimensional space matching this property for all document-summary pairs.
Interestingly, we can easily test whether this geometrical structure appears in real data with our inferred K. To do so, we perform a simultaneous multi-dimensional scaling (MDS) embedding of documents D i , human-written summaries S i , and K. In this space, two distributions are close to each other if their KL divergence is low. We plot such an embedding in Fig. 2 for 6 randomly chosen topics from TAC-09 and K inferred by HPL. We indeed observe documents, summaries, and K nicely aligned such that the summaries are close to their documents but far away from K. This finding also holds for K inferred by MS|D. These observations are important for two reasons. (1) They show that general framework introduced in Fig. 1 is an appropriate model of the summarization data: For any given topic, the reference summaries are arranged on one side of the document. They deviate from the document in a systematic way that is explained by the repulsive action of the background knowledge. Human-written summaries contain information from the document but not from the background knowledge which puts them on the border of the space. (2) Our models can be seen to infer an appropriate background knowledge that is common to a wide spectrum of topics, as shown by the fact that K occupies the central point in the embedding of Fig. 2.

Applications
We now investigate some applications arising from our framework. As K is easily interpretable, we explore which units receive high or low scores. One can also use different subsets (or aggregations) of training data. Here, we look into annotator-specific K's and domain-specific K's.  Table 2: Example of words "known" and "unknown" according to the best K inferred by HPL. A word ω j is "known" ("unknown") according to K when P K (ω j ) is high (low).

Qualitative analysis
To understand what is considered as "known" (P K (ω j ) is high) or "unknown" (P K (ω j ) is low), we fit our best model, HPL, using all TAC data for two choices of semantic units: (i) words and (ii) LDA topics trained on the English Wikipedia (40 topics).
In Table 2 we report the top "known" and "unknown" words. Frequent but uninformative words like 'said' or 'also' are considered known and thus undesired in the summary. On the contrary, unknown words are low-frequency, specific words that summarization systems systematically failed to extract although they were important according to humans. We emphasize that the inferred background knowledge encodes different information than a standard IDF. We provide a detailed comparison between K and IDF in Appendix E.
When using a text representation given by a topic model trained on Wikipedia, we obtain the following top 3 most known topics (described by 8 Given that these topics tend to be the most frequent in news datasets, K trained with human annotations learns to penalize systems overfitting on the frequency signal within source documents. On the contrary, series, games, and uni- versity topics receive low scores and should be extracted more often by systems to improve their agreement with humans.

Inferring annotator-and domain-specific background knowledge
Within the TAC datasets, the annotations are also tagged with an annotator ID. It is thus possible to infer a background knowledge specific to each annotator, by applying our algorithms on the subset of annotations performed by the respective annotator. In TAC-08 and TAC-09 combined, 16 annotators are identified, resulting in 16 different K's. Instead of analyzing only news datasets with human annotations (like TAC), we can infer background knowledge from any summarization dataset from any domain as long as document-summary pairs are observed. To illustrate this, we consider a large collection of datasets covering domains such as news, legal documents, product reviews, Wikipedia articles, etc. These do not contain human annotations, so we employ our MS|D algorithm to infer a K specific to each dataset. The detailed description of these datasets is given in Appendix C.

Structure of differences.
To visualize the differences between annotators, we embed them in 2D using MDS with two annotators being close if their K are similar. In Fig. 3 (a), each annotator is a dot whose size is proportional to how well its K generalizes to the rest of the TAC datasets, as evaluated by the correlation (Kendall's τ ) between the induced θ K and human judgments. The same procedure is applied to domains and is depicted in Fig. 3 (b). News datasets appear at the center of all domains meaning that the news domain can be seen as an "average" of the peripheral non-news domains. Furthermore, the K's trained on different news datasets are close to each other, indicating a good level of intra-domain transfer; and unsurprisingly, news datasets also exhibit the best transfer performance on TAC.
Improvements due to averaging. Based on previous observations, we make the hypothesis that averaging different annotator-specific K's can lead to better correlation with human judgments on the unseen part of the TAC dataset. Similarly, news domains generalize better than other domains. We hypothesized that averaging domains may also result in improved correlations with humans in the news domain.
In Fig. 4(a), we report the improvements in correlation with human judgments on TAC (news domain) resulting from averaging an increasing number of annotators or domains. The error bars represent 95% confidence intervals arising from selecting a different subset to compute the average. As we see, increasing the number of annotators averaged results in clear and significant improvements. Since the error bars are small, which annotators are included in the averaging has little impact on the results. Similarly, averaging different domains also results in significant improvements. In particular, averaging several non-news domains gives better generalization to the news domain.
Furthermore, Fig. 5 shows, in the GloVe (Pennington et al., 2014) embedding space, the K's resulting from averaging (a) all annotators (K's inferred by HPL), (b) all news datasets (K's inferred by MS|D), and (c) all non-news datasets (K's inferred by MS|D) in comparison to (d) the optimal K learned with HPL trained on all data from TAC datasets. To produce these visualizations, we perform a density estimation of the K's in the 2D projection of word embeddings.
All averaged K's tend to be similar to the optimal K. It indicates that only one prior produces strong results on the news datasets and it can be obtained by averaging many biased but different K's. This is further confirmed by Fig. 4(b), where the distance to the optimal K (measured in terms of KL divergence) significantly decreases when more annotators are averaged. 3

Conclusion
We focus on the often-ignored background knowledge for summarization and infer it from implicit signals from human summarizers and annotators. We introduced and evaluated different approaches, observing strong abilities to fit the data.
The newly-gained ability to infer interpretable priors on importance in a data-driven way has many potential applications. For example, we can describe which topics should be extracted more frequently by systems to improve their agreement with humans. Using pretrained priors also helps systems to reduce overfitting on the frequency signal within source documents as illustrated by initial results in Appendix D.
An important application made possible by this framework is to infer K on any meaningful subset of the data. In particular, we learned annotatorspecific K's, which yielded interesting insights: some annotators exhibit large differences from the others, and averaging several, potentially biased K's results in generalization improvements. We also inferred K's from different summarization datasets and also found increased performance on the news domain when averaging K's from diverse domains.
For future work, different choices of semantic units can be explored, e.g., learning K directly in the embedding space. Also, we fixed α = β = 1 to get comparable results across methods, but including them as learnable parameters could provide further performance boosts. Investigating how to infuse the fitted priors into summarization systems is another promising direction.
More generally, inferring K from a commonsense task like summarization can provide insights about general human importance priors. Inferring such priors has applications beyond summarization, as the framework can model any information selection task.  Figure 6: Visualization of each annotator's K based on 2D projection of Glove word embedding.

A 2D Visualization of K
For each annotator and each domain, we produce visualizations in the 2D embedding space with the same procedure as in Fig. 5. Fig. 6 depicts the annotators and Fig. 7 depicts the domains. It is interesting to observe much more diversity resulting from the domains and the domain-specific K's are more spread out in the semantic space. This reflects the greater topic diversity discussed in different domains. In contrast, each annotator's K is inferred based on the TAC datasets, which are in the same domain (news).

B Derivation of Approaches
The direct score maximization model consists in maximizing: We use Lagrange multipliers with the constraint that K is a valid distribution:  (MS|U) First, with P = U the uniform and γ = 0, we have the following derivatives: Setting the Lagrange derivative to 0 yields: where λ is the normalizing constant. In particular, when γ = 1: Note that choosing γ ≥ 1 ensures that for all ω j , we have P K (ω j ) > 0.
(MS|D) Second, we consider the case P = D the document and γ = 0. U changes with every document-summary pair and L becomes: Then, only the the derivative concerning KL(U ||K) is modified and becomes: which gives the following solution after setting the Lagrange derivative to 0: Here it is not clear that P K (ω j ) is positive for every units. To avoid such issue, notice that we can choose γ ≥ min j P S (ω j ) P D (ω j ) .

C Datasets
The summarization track at the Text Analysis Conference (TAC) was a direct continuation of the DUC series. In particular, the main tasks of TAC-2008 (Dang andOwczarzak, 2008b) and TAC-2009 (Dang andOwczarzak, 2009b) et al., 2015) has been decisive in the recent development of neural abstractive summarization (See et al., 2017;Paulus et al., 2017;Cheng and Lapata, 2016). It contains CNN and Daily Mail articles together with bullet point summaries. Zopf et al. (2016) also viewed the highquality Wikipedia featured articles as summaries, for which potential sources were automatically searched on the web.
(P.V.S. et al., 2018) recently crawled the liveblog archives from the BBC and The Guardian together with some bullet-point summaries reporting the main developments of the event covered.
To evaluate their opinion-oriented summarization system, Ganesan et al. (2010) constructed the Opinosis dataset. It contains 51 articles discussing the features of commercial products (e.g., iPod's Battery Life).
Furthermore, we consider the large PubMed dataset (Cohan et al., 2018), a collection of scientific publications.
The Reddit dataset (Kim et al., 2019) has been collected on popular sub-reddits.
The AMI corpus (Carletta et al., 2005) is a standard product review summarization dataset. Koupaee and Wang (2018) automatically crawled the WikiHow website using the selfreported bullet points as summaries.
The XSUM dataset (Narayan et al., 2018) is a large collection of news articles with a focus on abstractive summaries.
To measure the effect of information distortion in summarization cascades of scientific results, Horta Ribeiro et al. (2019) collected manual summaries of various lengths.
We also included the LegalReport dataset (Galgani et al., 2012) where the task is to summarize legal documents.

D Extracting Summaries: Example
Once K is specified, the summary scoring function θ K can be used to extract summaries. For extractive summarization, this is an optimal subset selection problem (McDonald, 2007). Unfortunately, θ K is not linear and cannot be optimized with Integer Linear Programming. It is also not submodular and cannot be optimized with the greedy algorithm for submodularity. We have to rely on generic optimization techniques which do not make any assumption about the objective function and can approximately optimize any arbitrary function. We use the genetic algorithm proposed by Peyrard and Eckle-Kohler (2016) 4 which creates and iteratively optimizes summaries over time. We denote as (θ K , Gen) the summarization system approximately solving the subset selection problem. We compare 3 systems: when K is inferred by MS|D, when K is inferred by HPL and when K is the uniform distribution. 5 For reference, we report the standard summarization baselines described in the previous section. The summaries are evaluated with 2 automatic evaluation metrics: ROUGE-2 recall with stopwords removed (R-2) (Lin, 2004) and a recent BERT-based evaluation metric (MOVER) (Zhao et al., 2019). The results, reported in Table 4, are encouraging since the systems based on the learned priors outperform the uniform prior. They also perform well in comparison to baselines. The inferred prior can benefit systems by preventing them from overfitting on the frequency signal.   To verify that our inferred K contains different information from ID, we compare IDF and our optimal K (see Sec. 7). To be comparable, IDF weights need to be renormalized, as the IDF weights of known (unknown) words would be low (high) whereas P K would be high (low). Thus, we compute 1 C (1 − IDF(ω j )) for each word ω j , where C = j (1 − IDF(ω j )).
In Fig. 8a, we represent the full distributions over all words in the support of K and show the absolute difference with renormalized IDF weights. Furthermore, Fig. 8b is a scatter plot where each dot represent a word and the coordinates are its IDF and K weights. The low correlation between the two indicates that K learns a different signal than IDF.