Reconstructing Implicit Knowledge with Language Models

In this work we propose an approach for generating statements that explicate implicit knowledge connecting sentences in text. We make use of pre-trained language models which we refine by fine-tuning them on specifically prepared corpora that we enriched with implicit information, and by constraining them with relevant concepts and connecting commonsense knowledge paths. Manual and automatic evaluation of the generations shows that by refining language models as proposed, we can generate coherent and grammatically sound sentences that explicate implicit knowledge which connects sentence pairs in texts – on both in-domain and out-of-domain test data.


Introduction
In everyday communication and in texts people usually omit information that seems clear and evident, such that only part of the message needs to be expressed in words. In the following sentence: (1-i) Students should be allowed to use computers during the lectures, (1-ii) even though that bears the risk that they are writing emails instead of listening to the teacher.
in order to understand the connection between (i) and (ii) we must know that Computers are used for sending emails, or that Lectures are given by teachers. Such implicit knowledge can easily be inferred by humans, since it is part of their background knowledge. By contrast, for computational systems implicitness in texts represents a challenge.
In this work we propose an approach for generating implicit knowledge sentences in-between contiguous sentences, which explicate their logical connection, utilizing pre-trained language models (LMs) that we refine as follows: i) we inject 'explanatory' knowledge by fine-tuning LMs on specifically prepared corpora, and (ii) condition text generation through constraints in form of relevant concepts and knowledge paths. Our work is inspired by the recent success of pre-trained LMs (Devlin et al., 2018;Radford et al., 2019;Yang et al., 2019a) in various downstream NLP tasks, including text generation and NL inference (Wang et al., 2018). However, for the task of reconstructing implicit knowledge, such LMs need to be carefully guided, not only to yield coherent statements, but to also ensure that they convey the missing, implicit information that connects given sentences in a text. To this end we create corpora with sentence pairs enriched with implicit information based on on Generics-KB (Bhakthavatsalam et al., 2020) and e-SNLI (Camburu et al., 2018), which we use for LM fine-tuning. For improved performance we explore methods of constrained language generation, guiding the model by way of relevant concepts and connecting commonsense knowledge paths.
We aim to build a system that is not limited to specific text genres or knowledge domains, and thus evaluate our models in-domain -on testsets from our fine-tuning corpora; and out-of-domainusing IKAT (Becker et al., 2020), an argumentative corpus which offers sentence pairs annotated with implicit knowledge that connects them.
A central contribution of this work is an indepth evaluation of the quality of generations delivered by different model variants, and their ability of expressing implicitly conveyed knowledge. We propose a manual evaluation setup covering four dimensions -grammaticality, coherence, content, and comparison to gold references -, and compare these to various automatic evaluation metrics. Our experiments show that with our proposed approach we can generate coherent sentences that explicate implicit knowledge that connects given sentence pairs; and that current text generation metrics are not sufficient to evaluate this challenging task.
Our contributions are: (i) We empirically compare different types of LMs, exploring which model is best suited for the task of generating sentences that express implicit information between sen-tences. (ii) We create datasets that include implicit information holding between sentence pairs, which we use for fine-tuning our LMs, and which can be used for general commonsense reasoning tasks. (iii) We propose a method for constrained generation by injecting concepts or commonsense knowledge paths as language modeling constraints, and show that key concepts, and even more, knowledge paths improve the quality of generations. (iv) We carefully evaluate the quality of the generated implicit knowledge sentences, both manually and automatically, and discuss strengths and limitations of automatic similarity metrics. 1

Related Work
Recent progress in pretraining LMs on large text corpora led to improvements for various downstream NLP tasks. It has also been shown that knowledge acquired during pre-training can be leveraged by fine-tuning these models to advanced semantic inference or NL generation tasks (Wang et al. 2018). Recently, pre-trained LMs have been augmented with external knowledge from commonsense knowledge bases such as ConceptNet, which provides more explicit knowledge grounding and improves their performance on downstream tasks that require reasoning abilities. Wang et al. (2020b), for example, retrieve multi-hop knowledge paths from ConceptNet for fine-tuning LMs for multiple choice question answering. Chang et al. (2020) and Bosselut et al. (2021) incorporate knowledge paths from ConceptNet into pre-trained LMs for solving the SocialIQA task . However, all these approaches evaluate the effectiveness of integrating commonsense knowledge indirectly on downstream tasks, and do not explicitly evaluate the impact and relevance of knowledge for a specific system prediction. We address this shortcoming by generating and carefully evaluating statements that connect pairs of sentences as explanations of their underlying, implicit knowledge link. Closest to this aim is the task of explanation generation, which has received attention very recently. Wang et al. (2020a) propose the SemEval-2020 Task 4 (Subtask C), which is to generate an explanation for why a statement does not make sense, by way of a natural language statement. A comparison of the participating systems (cf. Peru-mal et al. /Jon et al. 2020) shows that pre-trained LMs play a central role in the success of the topperforming systems, demonstrating that they contain commonsense information to a good extent. The success of models enriched with knowledge from external sources such as ConceptNet furthermore shows that additional knowledge supports the generation of commonsense explanations. However, there is still a large gap between systems and human performance.
Pre-trained LMs enhanced with commonsense knowledge have also been the models of choice for other text generation tasks, e.g. dialogue generation (Zhou et al., 2018), story ending generation (Guan et al., 2020), or abductive NLI (Ji et al., 2020b). While these models aim at generating explanations for a single statement, or completing a given sequence of sentences, we investigate how to make use of LMs to generate a sentence that fills in implicit knowledge between two sentences.
Constraining LMs. Recent work addresses how to control content in LM text generation, while maintaining fluency, coherence and plausibility of the generated text. Lin et al. (2020) explore how to generate a coherent and plausible situation description given an unordered set of concepts as input, and find that even pre-trained LMs (BART, T5) fine-tuned to this task cannot solve it: the generated sentences are grammatical, but highly implausible, lacking commonsense. This suggests that either the underlying LMs, or input constraints for generation need to incorporate commonsense knowledge. Orbach and Goldberg (2020) attempt to control the content when generating longer stories by specifying facts the story needs to include. They propose a plan-and-cloze model that first creates a cloze template, placing input facts at fixed positions in the output. In the cloze step, the system expands the fact tokens into complex sentences that complete the story. While uni-directional LMs such as GPT-2 or BART generate fluent text but do not well adhere to the desired content, the fine-tuned multi-directional XLNet outputs coherent text and adheres to the facts.
While none of the above works incorporate external knowledge to guide generation, Ji et al. (2020a) perform explanation generation for single statements, using ConceptNet background knowledge. The model selects concepts from the statement, retrieves connecting paths from ConceptNet, and selects bridge concepts from a subgraph. A pre-trained decoder generates the explanation, using as input the statement and top-ranked concepts from the subgraph. In our work we also select concepts from texts, but dynamically generate commonsense knowledge paths as constraints. Importantly, we aim to generate coherent explanations in-between sentences -a challenge for uni-directional LMs.
3 Knowledge-constrained text generation

Task Definition and Approach
The task we tackle in this work is: given two contiguous sentences (source sentences S 1 , S 2 ), generate an explanatory sentence (target sentence T ) that explains the underlying, implicit information that connects them. We explore different types of LMs and their aptness for solving this task. We fine-tune them on existing or adapted datasets to inject relevant knowledge, and add key concepts or connecting knowledge-paths as constraints to achieve coherent and informative explanations.

Types of Language Models
We compare three types of LMs: GPT-2 (Radford et al., 2019), an autoregressive model which generates the output sequence from left to right; XLNet (Yang et al., 2019b), a bidirectional generalized autoregressive LM; and BART (Lewis et al., 2019), a seq2seq model with a bidirectional masked encoder and a left-to-right decoder. While GPT-2 and BART generate the next tokens seeing only the left (previous) context, XLNet predicts the next tokens based on the left and right context, in a random order. GPT-2 is pre-trained on web pages from Com-monCrawl, XLNet on CommonCrawl+ClueWeb (Callan et al., 2009), and BART on the CNN/DM summarization dataset (Hermann et al., 2015).

Fine-tuning LMs
Task-adapted Datasets for LM Fine-tuning. All chosen LMs are pre-trained on information that is explicit in text. To condition them to generate implicit information that connects sentences, we finetune them on datasets that include knowledge statements connecting contiguous sentence pairs. We create two such corpora, one based on Generics-KB (Bhakthavatsalam et al., 2020), which offers statements expressing generic knowledge; the other on e-SNLI (Camburu et al., 2018), which comprises explanations of inferential commonsense knowledge. Each data instance contains two source sentences S 1 , S 2 , a target sentence T , and two key concepts c 1 , c 2 which we extract from the original data as described below. For examples see Table 1.
Generics-KB contains naturally occurring generic sentences crawled from the web using linguistic rules and BERT-based scoring. It is rich in high-quality statements that express generic knowledge. Each generic sentence occurs in its surrounding context (1-5 sents before/after), hence each instance forms a triple consisting of the context before (C b ), the generic sentence (GS) and the context after (C a ). We collect all instances where a phrase p 1 (NP, VP, ADJP or ADVP) from GS also occurs in C b , and another phrase p 2 from GS occurs in C a . For each instance we extract the sentence containing p 1 and the one containing p 2 as our source sentences S 1 , S 2 ; GS as our target sentence T ; and p 1 and p 2 as key concepts c 1 , c 2 . e-SNLI is an extension of the SNLI dataset (Bowman et al., 2015), additionally annotated with explanations: Given a premise-hypothesis pair and the relation between them (entailment, contradiction, or neutral), annotators added natural language sentences that explain why the pair is in the relation. Annotators had to mark essential key phrases for the relation in premise and hypothesis, and had to formulate explanations that employ these key phrases. For fine-tuning and testing our models, we consider all instances labelled with entailment and contradiction relations (but do not include the labels in fine-tuning). We interpret premise and hypothesis as our source sentences S 1 and S 2 , the explanation as our target sentence T , and the marked key phrases as our key concepts c 1 and c 2 .
In-and Out-Of-Domain Test Sets. We test the resulting models in-domain -on testsets from our fine-tuning corpora; and out-of-domain -on the IKAT dataset (Becker et al., 2020), which is based on the argumentative Microtexts Corpus (Peldszus and Stede, 2015). For all sentence pairs S 1 and S 2 that are adjacent or argumentatively related, annotators added the implicit knowledge that connects them, using simple sentences, which we use as targets T . They also marked two key phrases in each implicit knowledge sentence, where in most cases one key phrase appears in the first source sentence, and the other in the second -which we interpret as key concepts c 1 and c 2 in our approach.

Constraining Explanation Generation
Our hypothesis is that unconditioned generation may not be sufficient to produce statements carry-  Table 1: Source sentence pairs and target sentences (reference) from our three datasets, with marked key concepts and automatically predicted knowledge paths between them.

Gen-KB
BL Patients often report back to the clinic with a worsening pain condition within one to two hours of first assessment. +c Patients often have few if any symptoms at first, but pain becomes less intense and less frequent in coming hours. +p Patients are admitted to the hospital with moderate to high intensity pain. e-SNLI BL A busy city that looks like new york city has a lot of people in it, so the city has to have a lot to people in the city.
+c The city has a lot of people in it because it is a busy city. +p A busy city implies that there are a lot of people in the city.

IKAT BL
The state and society must be found if a university lacks the funds to provide education and training. +c The state and the society must pay for education and training if the university lacks the funds.
+p If a university lacks the funds, it can not be providing education and training to its students. ing relevant knowledge which explains the connection between two sentences. Hence we experiment with direct injection of constraints or triggers to guide the generation to emit meaningful and coherent implicit knowledge statements: We include (i) key concepts as offered by each dataset, since we expect them to direct the model towards concepts that are relevant for explaining how the two sentences are related. We also include (ii) relational knowledge between the key concepts as constraints, by establishing multi-hop knowledge paths between them. To this end we combine relation classification and target prediction models specifically adapted to ConceptNet. The two respective models are based on LMs fine-tuned on ConceptNet (Speer et al., 2017), a large network that represents commonsense facts. 2 We generate single-and multihop paths between key concepts from a sentence pair, and use these paths as constraints when generating target sentences. We expect the generated paths to provide useful relational information for the model. Example paths appear in Table 1.

Data and Experimental Setup
Datasets. We use the data from GenericsKB and e-SLNI for fine-tuning and testing models (in-2 Details about the models appear in the Appendix. domain), and IKAT for testing out-of-domain. 3 For statistics see Table 3. All instances contain two source sentences S 1,2 , a target sentence T , and two key concepts c 1,2 , where c 1 ∈S 1 , c 2 ∈S 2 , and c 1,2 ∈ T . We experiment with c 1,2 , and with paths p generated between c 1 and c 2 as constraints, which we establish as explained above. Input Sequences. We build the input sequences by concatenating the source sentences S 1 and S 2 , separated by a SEP token. When including key concepts c 1,2 or knowledge paths p as constraints, we append them to the input sequence right after S 1 and S 2 , separated by a SEP token. Thus, the concepts and paths we use as constraints are encoded by the tokenizer of each language model together with the rest of the input sequence. Accordingly, our input sequences are structured as follows: Fine-tuning LMs. For LM fine-tuning, we append the target sentence to the input sequence, separated from the rest of the input by an EOT tag. GPT-2 and XLNet are trained to reconstruct the target sentence T . During inference, the models only see the source sentences, and constraints if  given, and they complete the input sequence by generating T . In contrast, BART encodes S 1 and S 2 , and its decoder is trained to predict T based on the encoded source sentences. We use the pre-trained models from Hugging-Face Transformers (Wolf et al., 2019) and adapt them for fine-tuning on our customized training data. In order to generate compact sentences capturing the relevant implicit knowledge (instead of long explanations), we set a length limitation of 20 tokens for each generation. More details about our models are listed in the Appendix.

Evaluation and Results
This section presents an in-depth evaluation of the quality of generations from different model variants, and their ability of expressing implicitly conveyed knowledge. We design a manual evaluation setup covering various dimensions, and compare the results to several automatic evaluation metrics. We conduct evaluation in-domain on our customized test data; and out-of-domain on IKAT.

Manual Evaluation
Questions to Annotators. 4 To filter out source sentence pairs between which no implicit information is missing, we first ask the annotators for each source sentence pair if they are implicitly connected by some (unexpressed) piece of knowledge (yes/no). The annotators are then guided through follow-up questions covering four dimensions: (1) Grammaticality -we ask if the generated sentence is grammatically correct, given the choices correct, almost correct (minor grammatical errors), and incorrect (major grammatical errors); (2) Coherence -we ask if the generated sentence is logically and semantically consistent with respect to the two source sentences, given the choices fully coherent, partly coherent, or incoherent; (3) Content -we ask if the generated sentence gives an explanation of the connection between the two source sentences, given the choices yes, neutral (if the generated sentence is related to the source sentences, but not in a clear logical relation), and no (if the sentence is misleading or contradictory in the context of the source sentences); 5 (4) Comparison to the annotated reference sentence 6 -we ask if the generated sentence is similar in meaning to the reference, given the choices similar, partly similar, or not similar. In addition, we ask if the reference sentence or the generated sentence is a more meaningful explanation of the implicit knowledge that connects the source sentences, or if both are equally meaningful explanations.
Annotation Setup. Our goal is to investigate which model variant is best suited for generating grammatically sound, coherent and meaningful explanations. We approach this question with two annotation rounds: In a first round we aim to determine which model is best suited for generating implicitly conveyed knowledge, and which dataset is best suited for fine-tuning the model for generating statements on out-of-domain test sets. In a second annotation round we aim to determine which types of constraints yield best results, now restricted to the best performing model and training setup, as determined in round one.
Annotator Agreement. Annotation was performed by two annotators with a background in computational linguistics. We measure IAA using Cohen's Kappa, combined over round one and two, and achieve an agreement of 95% on dimension 1, 80% on 2, 77% on 3, and on dimension 4 82% for the first and 78% for the second question. Remaining conflicts were resolved by an expert annotator.

Best Model Type and Fine-Tuning Data
For the first annotation round we sample 10 source sentence pairs from each testset, hence 30 pairs overall, and the sentences generated by GPT-2, XL-Net and BART for each instance, using concepts as 5 The difference between dimension 2 and 3 is that with dimension 2 (coherence), we want to explore if the generated sentence semantically fits to the two given source sentences. We understand coherence together with Hobbs (1979) as the existence of specific knowledge relations that hold between concepts in a text (or discourse), such as Cause-Effect, Condition, or Temporal Sequence, cf. Wolf and Gibson (2004). These relations make the texts interpretable and informative and are motivated ultimately by the speaker's or writer's need to be understood (Hobbs, 1979). In contrast, when evaluating the content of the generated sentence in dimension 3, we want to discover if the sentence really explains the connection between the two source sentences. 6 The reference sentence is only provided for Question 4.
16 constraints. For IKAT, we consider the sentences generated by each model fine-tuned on e-SNLI vs. GenericsKB. This sums up to 120 annotation samples (generated sentences). 7 In Fig. 1  Results. For all 30 sentence pairs the annotators agreed that there is some implicit information connecting them. Table 4 displays the results of the first annotation round for the four dimensions described above. All three models are able to generate grammatically correct sentences (col. 1), with BART's generations scored as correct most often. BART also generates the most coherent sentences (col. 2), in-domain (e-SNLI and GenericsKB) and out-of-domain (IKAT), followed by XLNet. For dimension 3, which evaluates whether the generations are meaningful explanations of implicit knowledge connecting the source sentences (col. 3), only BART fine-tuned on e-SNLI gives satisfactory results (in-domain, when fine-tuned and tested on e-SNLI; and out-of domain, when fine-tuned on e-SNLI and tested on IKAT). Many of the generations from GPT-2 are judged as neutral (orange in Table 4) or misleading (red). The last two columns reflect the comparison of the generated vs. annotated reference sentence (dimension 4). BART's generations are overall rated as most similar to the reference sentence, especially when fine-tuned on e-SNLI (in-and out-of-domain), and are judged as better or equally good explanations compared to the reference sentences in 70% (e-SNLI, in-domain) and 50% (IKAT-e-SNLI, out-of-domain).
To summarize, according to our first round of evaluation, the BART model generates the most grammatical and coherent statements that are found to explain the connection between the source sentences best. They are also judged to be most similar to the reference sentence. When applied on out-of-domain testsets, BART performs best when fine-tuned on e-SNLI.

Best Constraints
While the first round of annotations used a relatively small set of 120 generated target sentences that helped us to determine BART as the best-suited model type, we now aim to deeper investigate the generations of BART to study the effect of different types of constraints on the quality of expla-  when fine-tuned and tested in-domain on (i) e-SNLI and (ii) GenericsKB; or out-of-domain testing on IKAT, when fine-tuned on (iii) e-SNLI or (iv) GenericsKB; with marked best/worst scores for in-and out-of domain testing.
nations. We provide our annotators with 70 new source sentence pairs (20 from e-SNLI, 20 from GenericsKB, 30 from IKAT), and three different targets per pair, generated by three model variants of BART: (i) a baseline fine-tuned without any knowledge constraints; (ii) BART fine-tuned using the key concepts as constraints; and (iii) BART finetuned using an automatically generated commonsense knowledge path between the key concepts as constraint. Since fine-tuning on e-SNLI has been determined as best suited for out-of-domain testing, we consider only generations from BART fine-tuned on e-SNLI for testing on IKAT. In our evaluation we consider the 70 sentence pairs and the respective sentence generations from Round 2, and the generations for the 30 source sentence pairs from the best performing model BART from Round 1, resulting in 100 sentence pairs, with three generations per pair.
Results. Similar to Round 1, for 98% of the source sentence pairs the annotators agreed that there is some implicit information connecting them. Fig. 2 shows the results of the second round of evaluations, example generations appear in Table 2. We find that using knowledge constraints improves the quality of generations compared to the baseline without constraints, on all four dimensions: on each of our three test sets, generations are rated as more grammatical when constrained with concepts and paths (with GenericsKB as only exception); they are annotated as more coherent, and rated as better explanations of implicit knowledge. Knowledge constraints also lead to a higher similarity to the reference sentence on all three datasets, and sentences generated with knowledge constraints are more often rated as better explana-tions than the reference sentences. Overall we find that knowledge paths improve scores over the baseline more than concepts (a plus of 2-15 pp). The improvements are most significant for IKAT, where adding concepts boosts evaluation scores between 18 (Grammaticality) and 53 pp (Coherence), and adding paths by 20 (Grammaticality) and 55 pp (Coherence). The generations of BART, fine-tuned on e-SNLI, as shown in the first example in Fig. 1, demonstrate how the integration of paths as constraints can improve text generation even more than when only injecting key concepts. The path used as constraint is Germany's aging society CAUSES increasing costs. When constraining BART with key concepts, it generates The social security and pension costs are being paid for by the people of Germany, while the generation with the knowledge path as constraint is Social security and pension costs are rising because more pension is needed for elderly people in Germany). This shows that the relation CAUSES gives our model an important hint about the causal relation that is needed to explain the connection between the two given sentences.
To summarize, the results from our second evaluation round clearly show that constraints in form of relevant concepts and knowledge paths can help LMs for generating grammatically sound, coherent and meaningful explanations of the missing knowledge between sentences, especially when applied on out-of-domain test sets.

Automatic Evaluation
In our automatic evaluation setup, we apply a range of different evaluation metrics commonly applied in text generation tasks, which either measure the similarity to a reference sentence (in our case, the Figure 2: Results of 2 nd manual evaluation: comparing models constrained with concepts (+c) or paths (+p) against a baseline without constraints. We display improvements in percentage points (pp) for the best option (blue bar) per dimension. generic sentences in GenericsKB, inference explanations in e-SNLI, or implicit knowledge statements in IKAT); or the linguistic quality and diversity of the generated sentence.
(i) BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004) measure token overlap using ngrams. We apply BLEU-1 to measure precision and ROUGE-1 to measure recall based on unigrams; (ii) BERT-Score (Zhang* et al., 2020) and Sentence-BERT (Reimers and Gurevych, 2019) compute semantic similarity scores for text sequences based on word or sentence representations. BERT-Score uses BERT's contextualized word embeddings to calculate a cross similarity score for each token in the generation with each token in the reference, while Sentence-BERT is fine-tuned on NLI and STS to predict the similarity of two sequences. For BERT-Score we report F1 scores; for Sentence-BERT we average the similarity scores obtained for the generated vs. reference sentences.
(iii) S2Match (Opitz et al., 2020) is an AMR graph matching metric, which measures the overlap of the AMR semantic graphs that we construct from the reference and generated sentence using Cai and Lam (2020)'s parser, and reports accuracy; (iv) Distinct-N (Li et al., 2015) and GRUEN (Zhu and Bhat, 2020) are reference-free metrics that only consider properties of the generated sentence. Distinct-N measures the diversity of a sentence by focusing on the number of distinct unigrams (Distinct-1) and bigrams (Distinct-2); GRUEN evaluates the linguistic quality of a sentence in terms of grammaticality, non-redundancy, and structure.
In a preliminary experiment based on the complete test sets of Generics-KB, e-SNLI and IKAT (cf. Table 3) we first investigate which model generates sentences that are most similar to the reference sentence (using reference-based metrics), or which show highest linguistic quality and diversity (using reference-free metrics); and which dataset is best suited for fine-tuning the models for gener-  Table 5: Automatic similarity scores for generations of best performing model BART, w/o constraints or with concepts/paths as constraints. Adding concepts and paths improves scores in-domain (e-SNLI and Generics-KB), and outof-domain (IKAT finetuned on e-SLNI).
ating statements on out-of-domain test sets (here, IKAT). Results and detailed analysis of this experiment appear in our Appendix. We find that deciding which model performs best depends a lot on the chosen similarity metric, but overall we don't see the clear superiority of the BART model (nor the inferiority of GPT-2) that we determined through manual evaluation. While in Dimension 4 of the manual evaluation setup (where annotators judged whether generated and reference sentence express the same or similar meaning), BART was clearly rated as the best performing model, this is not reflected in the automatic evaluation scores. Among all metrics only SentenceBERT, giving highest scores to BART, followed by XLNet, aligns with our observations from manual evaluation. However, our other observation from manual evaluation -that e-SNLI is the most appropriate dataset for fine-tuing LMs for out-of-domain testing -aligns with the scores obtained by automatic evaluation metrics (for details, cf. Appendix). We next analyse which types of constraints improve generation, focusing on the BART model, which has shown to be best for generating implicit knowledge statements in our manual evaluation setup. Our automatic evaluation is based on the same subset of source sentence pairs used for the second round of manual annotations (cf .  Table 3), and we again compare generations without constraints to conditioning on key concepts or knowledge paths. 8 Results are displayed in Table  5. We observe that for all metrics, scores increase when constraining LMs with concepts or knowledge paths, with BLEU and S2Match scores for GenericsKB as only exceptions. As in manual evaluation (Fig. 1), we find that improvements are most significant for IKAT. The observed improvements may in part be traced back to increased word overlap due to key concepts being used as constraints. Yet we also observe that automatically generated knowledge paths between these concepts improve scores additionally -according to reference-based metrics (showing that generations become more similar to references), and reference-free metrics (showing improvement of the linguistic quality and diversity of generations). This points to the fact that constraining LMs with automatically generated relational knowledge is a promising step towards generating grammatically correct and meaningful implicit knowledge statements.

Discussion
Limitations of Automatic Evaluation Metrics for Text Generations. Concluding, we pinpoint two important limitations of automatic text generations metrics -especially reference-based ones: Besides well-known issues regarding the reliability, interpretability and biases of such metrics (Callison-Burch et al., 2006), scores are mostly obtained by comparing generations against a single reference, which is -here, as in other generation tasks -often only one among several valid options. For the task of reconstructing implicit information, Becker et al. (2017) show that annotators often propose different valid sentences for filling knowledge gaps in argumentative texts. For our setting this means that a generated sentence may be a relevant explicitation of implicit information, even if not similar to the reference. Such cases are poorly or not at all captured by automatic similarity metrics. An exception we found is SentenceBERT, which is based on sentence representations, and which aligned reasonably well with insights from our manual evaluation. Still, automatic evaluation of text generations needs to be considered with caution, and should always be accompanied by manual evaluation.
Our Implicitness Assumption. Our experiments are based on the underlying assumption that usually some information between pairs of sentences stays implicit, which has been confirmed empirically for our datasets: Our annotators stated for 100% (first round) and 98% (second round) of all sentence pairs that they are implicitly connected by some unexpressed piece of knowledge. However, we did not specifically address the cases of sentence pairs between which no implicit information is missing (even though these cases are rare), nor did we investigate how our models would perform when provided with sentence pairs that are not related (arbitrary pairs). For a real-world application, both aspects would be considerable.

Conclusion
In this work we propose an approach for generating statements that explicate implicit knowledge connecting sentences in text, using pre-trained LMs. We show that despite their great success in many NLP downstream tasks, LMs need to be well equipped and carefully guided for the challenging task of reconstructing implicit knowledge, to ensure that they convey the missing, implicit information that connects sentences in text. We refine different pre-trained LMs by fine-tuning on specifically prepared corpora that we enrich with implicit information, filled in between sentences, and explore methods of constrained language generation, guiding the models by way of relevant concepts and connecting commonsense knowledge paths.
While most current automatic NLG metrics are not sufficient to evaluate this challenging task, our in-depth evaluation of the quality of generations from different model variants shows that the BART model, which attends over its full input when generating text, yields most informative and relevant explanations. We also establish that e-SNLI, being focused on the NLI task, is best suited for conditioning LMs for our task, especially for out-of domain settings. Finally, by providing the LMs with relevant connecting key concepts as constraints, and further by connecting commonsense knowledge paths, we achieve generation of coherent and grammatically sound sentences that -according to manual evaluation -can explicate the implicit knowledge that connects sentence pairs in textsfor in-domain and out-of-domain test data.
GenericsKB datasets, respectively, while BART requires 8 hours, and XLNet around 20 hours (due to its permutation procedure) for the same data.
Limiting Length of Generations. In order to generate compact sentences capturing the relevant implicit knowledge (instead of long explanations), we set a length limitation of 20 tokens for each generation. In the left-to-right decoding procedure of GPT-2 and BART, the generation can be stopped earlier than 20 tokens, when the model predicts an EOT token. Thus, both GPT-2 and BART models can predict complete sentences of up to 20 tokens due to the autoregressive decoder. In contrast, XL-Net has a permutation language modeling mechanism and predicts the next tokens based on the previous and next tokens. Its generations usually don't contain a significant EOT token. predicted target sequence of tokens in a post-processing step by cutting it after a generated comma (,).
Maximum Sequence Lengths. Our customized train sets have different maximum sequence lengths: e-SNLI has a maximum sequence length of 80 tokens including the target sentence, while GenericsKB has up to 140 tokens per sequence.

B Establishing Knowledge Paths for Constraining Text Generation
For dynamically establishing connections between the key concepts from two source sentences, we combine two model types: COREC-LM (Becker et al., 2019), an open-world multi-label relation classifier enhanced with a pretrained language model, that predicts relation types between two given concepts -for establishing direct connections between concepts; and COMET (Bosselut et al., 2019), a pretrained transformer model that learns to generate target concepts given a source concept and a relation, for generating multihop paths. By combining the generations of these models, we generate single-and multihop paths between key concepts c 1 , c 2 from a sentence pair, and use these paths as constraints when generating target sentences. We are able to retrieve paths for 86.2% of all key concept pairs from GenericsKB, respectively, for 30.2% from e-SNLI and for 44.2% from IKAT. The differences can be explained by the fact that while the key concepts in GenericsKB are extracted phrases (NPs, VPs, ADJPs and ADVPs), the key concepts in e-SNLI and IKAT are manually labelled, and thus are often very specific and contain nested phrases (e.g. leans over a pickup truck (e-SNLI)). Therefore, it is more difficult to predict a relation or path between them. When we experiment with paths as constraints; for all instances where no path could be established between the key concepts, we only use the key concepts as constraints.

C Automatic Evaluation of the Complete Test Sets
As mentioned in Section 5.2 of our main paper, in a preliminary study based on the complete test sets of Generics-KB, e-SNLI and IKAT, we investigate which model generated sentences that are most similar to the reference sentence, or which show highest linguistic quality and diversity; and which dataset is best suited for finetuning the models for generating statements on out-of-domain test sets (here, IKAT). Results for this first analysis appear in Table 7. For metrics that measure token overlap (BLEU and ROUGE), highest scores are obtained when finetuning and testing on e-SNLI, which can be traced back to frequently used linguistic patterns (e.g., x implies y, or x is the same as y) that occur in train and test sets of e-SNLI. The reference-free metrics Distinct and GRUEN that measure diversity and non-redundancy, therefore yield higher scores when models are finetuned on the more diverse GenericsKB data, for both in-and out-of-domain testing. The AMR metric S2Match gives higher scores on e-SNLI than GenericsKB in in-domain testing, and finetuning on e-SNLI yields higher S2Match scores for out-of-domain testing on IKAT. This also aligns with the sentence representation based metric SentenceBERT. BertScore, finally, is not at all discriminative -it yields uniformly high scores for each model and configuration, ranging only between .88 and .9. We also find that the scores differ considerably for in-domain vs. out-of-domain testing: results on IKAT are lower compared to testing on e-SNLI or GenericsKB according to all reference-based metrics, while we observe the opposite for the reference-free metrics.
We next analyse on the complete test set which types of constraints improve generation, focusing on the BART model, which has shown to be best for generating implicit knowledge statements in our manual evaluation setup. The automatic evaluation scores for the complete test sets are displayed in Table 8 and confirm our findings from the subset of the second annotation round, as presented in  Section 5.2 of our main paper.

D Example Generations
In addition to the examples shown in our main paper, in Fig. 1 Table 7: Automatic Similarity scores computed for the generations of all models, on the complete test sets. We compare the impact of (i) model types and (ii) data used for finetuning (train), in-domain (GenericsKB and e-SNLI) and out-ofdomain (IKAT).