Adaptable and Interpretable Neural MemoryOver Symbolic Knowledge

Past research has demonstrated that large neural language models (LMs) encode surprising amounts of factual information: however, augmenting or modifying this information requires modifying a corpus and retraining, which is computationally expensive. To address this problem, we develop a neural LM that includes an interpretable neuro-symbolic KB in the form of a “fact memory”. Each element of the fact memory is formed from a triple of vectors, where each vector corresponds to a KB entity or relation. Our LM improves performance on knowledge-intensive question-answering tasks, sometimes dramatically, including a 27 point increase in one setting of WebQuestionsSP over a state-of-the-art open-book model, despite using 5% of the parameters. Most interestingly, we demonstrate that the model can be modified, without any re-training, by updating the fact memory.


Introduction
Neural language models (LMs) (Peters et al., 2018;Devlin et al., 2019;Raffel et al., 2019) that have been pre-trained by self-supervision on large corpora contain rich knowledge about the syntax and semantics of natural language (Tenney et al., 2019), and are the basis of much recent work in NLP. Pretrained LMs also contain large amounts of factual knowledge about the world (Petroni et al., 2019;Roberts et al., 2020;Brown et al., 2020). However, while large LMs can be coerced to answer factual queries, they still lack many of the properties that knowledge bases (KBs) typically have. In particular, it is difficult to distinguish answers produced by memorizing factual statements in the pre-training corpus from lower-precision answers produced by linguistic generalization (Poerner et al., 2019). It is also difficult to add or remove factual information without retraining the LM, an expensive pro-cess 1 . The difficulty of updating knowledge in neural LMs contrasts with symbolic KBs, where it is very easy to add or modify triples, and is a major disadvantage of using a LM "as a KB"-as in many domains (news, product reviews, scientific publications, etc) the set of known facts changes frequently. Symbolic KBs thus remain practically important (Google, 2012;Dong, 2017), especially for NLP applications where text is hard to automatically process (e.g., scientific, technical, or legal) or tasks rich in information that exists only in structured form (e.g., technical specifications of a new product, where no product page or review text discussing it yet exists).
Motivated by this, past work has sought to combine the benefits of neural LMs with the large, broad-coverage KBs that now exist (Bollacker et al., 2008;Auer et al., 2007;Vrandečić and Krötzsch, 2014). This paper continues this research program with a new knowledge-augmented LM called Fact Injected Language Model (FILM). FILM is a masked LM, where masks can be filled either from the token vocabulary or an entity vocabulary. The vector representation of each entity in a KB is jointly learned alongside other parameters of a Transformer LM, and stored in a separate entity memory. FILM also includes a fact memory where each element is derived from a triple of vectors, representing a KB entity or relation. Since these triples are defined compositionally from (representations of) entities and relations, they have an interpretable symbolic meaning: e.g., if e mtv is the vector representation of KB entity "Mountain View, CA" and e google and r hq similarly correspond to "Google Inc" and the relation "headquartered in", these vectors can be used to construct a memory element f (e google , r hq , e mtv ) for the KB assertion "Google, Inc is headquartered in Mountain View, CA". This means that the fact memory can be easily extended with new facts.
In analysis on four benchmark question answering datasets we show that FILM improves significantly, and sometimes dramatically, over several strong baselines (e.g. BART  and T5 (Raffel et al., 2019)) and this improvement is even larger when removing train-test overlap. In one setting of WebQuestionsSP, we outperform the next best performing model (RAG (Lewis et al., 2020a)) by 27 points despite using only 5% of the number of parameters.
Most interestingly, we demonstrate that FILM models can be updated without any re-training, by modifying the fact memory. Specifically, in §4.1, we show we can inject new fact memories at inference time, enabling FILM to correctly answer questions about pairs of entities that were never observed in the training (either during pre-training or fine-tuning). In §4.2 we also evaluate updating the model by inserting contra-positive facts that contradict facts mentioned in the pretraining data, and we show that FILM can correctly answer novel questions in this scenario as well. To summarize, this paper's contributions are: 1. We propose a neural LM for knowledgeintensive question-answering tasks that incor-porates a symbolic fact memory. 3. We show FILM can easily adapt to newly injected and modified facts without retraining.

Fact Injected Language Model Model
The Fact Injected Language Model (FILM) model (see Figure 1) extends the Transformer (Vaswani et al., 2017) architecture of BERT (Devlin et al., 2019) with additional entity and facts memories. These memories store semantic information which can later be retrieved and incorporated into the representations of the transformer. Similar to the approach in Févry et al. (2020), entity embeddings will (ideally) store information about the textual contexts in which that entity appears, and by inference, the entity's semantic properties. The fact memory encodes triples from a symbolic KB, constructed compositionally from the learned embeddings of the entities that comprise it and implemented as a key-value memory which is used to retrieve entities given their KB properties. This combination results in a neural LM which learns to access information from a symbolic KB.

Definitions
We represent a Knowledge Base K as a set of triples (s, r, o) where s, o ∈ E are the subject and object entities and r ∈ R is the relation, where E and R are pre-defined vocabularies of entities and relations. A text corpus C is a collection of paragraphs 2 {p 1 , . . . , p |C| }. Let M be the set of entity mentions in the corpus C. A mention m i is encoded as (e m , s p m , t p m ), indicating entity e m is mentioned in paragraph p starting at token position s p m and ending at t p m . We will usually drop the superscript p and use s m and t m for brevity.

Input
The input to our model is a piece of text; either a question during fine tuning (see §A.2.2) or a paragraph in pre-training (see §A.2.1). Pretraining is formulated as a cloze-type Question Answering (QA) task: given a paragraph p = {w 1 , . . . , w |p| } with mentions {m 1 , . . . , m n }, we sample a single mention m i to act as the cloze answer and replace all tokens of m i with [MASK] tokens. The entity in E named by the masked entity is the answer to the cloze question q ('United Kingdon' in the example input of Figure 1). Mentions in the paragraph other than m are referred to below as context mentions. In the following sections we describe how our model learns to jointly link context entities ( §2.3) and predict answer entities ( §2.5).

Entity Memory
Our entity memory E ∈ R |E|×de is a matrix containing a vector for each entity in E and trained as an entity-masked LM. The model input is a text span containing unlinked entity mentions with known boundaries 3 . Mentions are masked with some probability. Our entity memory follows Entity as Experts (EaE) (Févry et al., 2020) which interleaves standard Transformer (Vaswani et al., 2017) layers with layers that accesss the entity memory 4 .
Given a piece of text q = {w 1 , . . . , w |q| } the contextual embedding h These contextual embeddings are used to compute query vectors that interface with the entity memory. For each context mention m i = (e m i , s m i , t m i ) in q, we form a query vector to access the Entity memory by concatenating the context embeddings for the mention m i 's start and end tokens, h (l) sm i and h (l) tm i and projecting them into the entity embedding space. We use this query to compute attention weights over the full entity vocabulary and produce an attention-weighted sum of entity embeddings u l m i . The result is then projected back to the dimension of the j-indexed contextual token embeddings, and added to what would have been the input to the next layer of the Transformer: After the final transformer layer T , h (T ) m i is used to predict the context entitiesê m i and produce a loss with I em i , the one-hot label of entity e m i . Following Févry et al. (2020), we supervise the entity access for the intermediate query vector in Eq. 1.

Fact Memory
FILM contains a second fact memory, populated by triples from the knowledge base K, as shown on the right side of Figure 1 5 . The fact memory shares its on entity representations with the entity memory embeddings in E, but each element of the fact memory corresponds to a symbolic substructure, namely a key-value pair ((s, r), {o 1 , . . . , o n }). The key (s, r) is a (subject entity, relation) pair, and the corresponding value {o 1 , . . . , o n } is the list of object entities associated with s and r, i.e. (s, r, o i ) ∈ K for i = {1, . . . , n}. Conceptually, KB triples with the same subject entity and relation are grouped into a single element. We call the subject and relation pair a j = (s, r) ∈ A a head pair and the list of objects b j = {o 1 , . . . , o n } ∈ B a tail set 6 .
In more detail, we encode a head pair a j = (s, r) ∈ A by concatenating embeddings for the subject entity and relation, and then projecting them linearly to a new head-pair embedding space. More precisely, let E ∈ R |E|×de be the entity embeddings trained in §2.3, and R ∈ R |R|×dr be embeddings of relations R in the knowledge base K. We encode a head pair a as: where s ∈ E and r ∈ R are the embeddings of subject s and relation r, and W a is a learned linear transformation matrix. We let A ∈ R |A|×da denote the embedding matrix of all head pairs. Let the answer for q be denoted e ans , and its masked mention m ans = (e ans , s ans , t ans ). For a masked mention m ans , define a query vector to access the fact memory as: where h tans are the contextual embeddings for the start and end tokens of the mention m ans , and W f is the linear transformation matrix into the embedding space of head pairs A.
Head pairs in A are scored by the query vector v mans and the top k head pairs with the largest inner product are retrieved. This retrieval process on the fact memory is distantly supervised. We define a head pair to be a distantly supervised positive example a ds = (s, r) for a passage if its subject entity s is named by a context mention m i and the masked entity e ans is an element of the corresponding tail set, i.e. e ans ∈ b ds . When no distantly supervised positive example exists for a passage, it is trained to retrieve a special "null" fact comprised of the s null head entity and r null relation: i.e. a ds = (s null , r null ) and its tail set is empty. This distant supervision is encoded by a loss function: The result of this query is that the tail sets associated with the top k scored head pairs, i.e. {b j |j ∈ TOP k (v, A)}, are retrieved from the fact memory.

Integrating Knowledge and Context
Next, tail sets retrieved from the fact memory are aggregated. Recall that a tail set b j returned from the fact memory is the set of entities {o 1 , . . . , o n } s.t. (s, r, o i ) ∈ K for i ∈ {1, . . . , n} with the associated a j = (s, r). Let o i ∈ E be the embedding of entity o i . We encode the returned tail set b j as a weighted centroid of the embeddings of entities in the tail set b j .
where α i is a context-dependent weight of the object entity o i . To compute the weights α i , we use a process similar to Eq. 4: we compute a second query vector z mans to score the entities inside the tail set b j , and the weights α i are the softmax of the inner products between the query vector z mans and the embeddings of entities in the tail set b j .
where W b is a transformation matrix distinct from W e in Eq. 1 and W f in Eq. 4. The top k tail sets b j are further aggregated using weights β j , which are the softmax of the retrieval (inner product) scores of the top k head pairs a j . This leads to a single vector f mans that we call the knowledge embedding for the masked mention m ans .
Intuitively f mans is the result of retrieving a set of entities from the fact memory. The last step is to integrate this retrieved set into the Transformer's contextual embeddings. Of course, KBs are often incomplete, and especially during pre-training, it might be necessary for the model to ignore the result of retrieval, if no suitable triple appears in the KB. To model this, the final step in the integration process is to construct an integrated query q mans with a learnable mixing weight λ. Algorithmically, λ is computed as the probability of retrieving a special "null" head a null from the fact memory, i.e. whether an oracle head pair exists in the knowledge base. q mans is used to predict the masked entity.

Open vs Closed Book models
Generally, open book models refer to 'retrieve and read' pipelines (Chen et al., 2017) which, given a query, 1) retrieve relevant passages from a corpus, 2) separately re-encode the passages conditioned on the question and then 3) produce an answer. Conversely, closed book models answer questions directly from their parameters without additional processing of source materials. We consider FILM and EaE closed-book models as they do not retrieve and re-encode any source text, and instead attend to parameterized query-independent memories.

Results in Convention Settings
LAMA TREx. In Table 1, we can see that FILM outperforms several recently proposed models on the LAMA TREx task. FILM outperforms the next best performing model, BERT-KNN by 5.5 points. Question-Answering. In Table 2

Results on KB-Answerable Questions
WebQuestionsSP (and similarly FreebaseQA discussed in §4) was constructed such that all questions are answerable using the FreeBase KB, which was last updated in 2016. Because our pretraining corpus is derived from larger and more recent versions of Wikipedia, we elected to use a KB con-   structed from Wikidata. Many entities in Freebase are unmappable to the more recent Wikidata KB which means that some questions are no longer answerable using the KB. Because of this, we created reduced versions of these datasets which are Wikidata answerable-i.e., containing only questions answerable by triples from our Wikidata-based KB. The model should learn to rely on the KB to answer the questions. We do the same for TriviaQA. 8 As seen in Table 2 in the column Wikidata answer-Total, FILM does much better on Wikidata answerable questions on WebQuestionsSP. EmQL , the state-of-the-art dataset specific model, gets 75.5% accuracy on the full dataset. Not surprisingly, this is because EmQL operates over the Freebase knowledge base, giving it full upperbound recall. However, when we restrict to Wikidata answerable questions, thus giving both EmQL and FILM potential for full recall, FILM outperforms EmQL by 3.5 points and the next best model (RAG) by over 15 points. 8 TriviaQA does not have linked entities in its questions so for those results we relax this restriction to include all examples where the answer resolves to a Wikidata entity.

Train-Test Overlap
We are interested in the ability of models to use external knowledge to answer questions, rather than learning to recognize paraphrases of semantically identical questions. Unfortunately, analysis showed that many of the test answers also appear as answers to some training-set question: this is the case for 57.5% of the answers in WebQuestionsSP and 75.0% for FreebaseQA. This raises the possibility that some of the performance can be attributed to simply memorizing specific question/answer pairs, perhaps in addition to recognizing paraphrases of the question from its pretraining data.
Overlap in fine-tuning train/test splits was concurrently observed by Lewis et al. (2020b), who created human verified filtered splits for TriviaQA and WebQuestions. We evaluate our models on those splits and report results in Table 2 in the "No Overlap" columns. We see that the gap between FILMand the next best performing model RAG increasses from 4.6 to 5.7 points on WebQuestionSP. On TriviaQA, FILMis still able to answer many questions correctly after overlap is removed. In contrast, the majority of closed book models such as BART get less than 1% of answers correct.

Filtering to Avoid Pretrain, Finetune, and Test Overlap
The filtering procedure from Lewis et al. (2020b) addresses finetuning train/test overlap but does not account for overlap with the pretraining data. To investigate this further, we looked at FreebaseQA and WebQuestionsSP which both contain entity linked questions and answers. We first perform a similar procedure to Lewis et al. (2020b) and discard questions in the fine-tuning training data that contain answers which overlap with answers to questions in the dev and test data. We end up with 9144/2308/3996 data (train/dev/test) in Free-baseQA and 1348/151/1639 data in WebQuestion-sSP. This setting is referred to as Fine-tune column in Table 4 which shows the effects of different filterings of the data.
Next we want to ensure that the model will be unable to simply memorize paraphrases of question answer pairs that it observed in the text by removing all overlap between the pretraining data and finetuning test data. For every question answer entity pair in our finetuning dataset (coming from any split), we filter every example from our Wikipedia pretraining corpus where those pair of entities cooccur. Additionally, we filter every fact from our fact memory containing any of these entity pairs. Results for this setting are in the column labeled Pretrain. The All column combines both pretrain and fine tune filtering. We see that the models perform substantially worse when these filterings are applied and they are forced to reason across multiple examples, and in the case of FILM, the fact memory. Finally, the column denoted None has no filtering and is the same as the Full Dataset.

Modifying the Knowledge Base
Because our model defines facts symbolically, it can in principle reason over new facts injected into its memory, without retraining any parameters of the model. Since existing datasets do not directly test this capability, we elected to construct variants of FreebaseQA and WebQuestionsSP where we could simulate asking questions that are answerable only from newly injected KB facts.
The approach we used was to (1) identify pairs of entities that occur in both a question and answer of some test example; (2) filter out such pairs from the KB as well as all pre-training and fine-tuning data; and (3) test the system trained on this filtered data, and then manually updated by injecting facts about those entity pairs. This filtering procedure is reminiscent of that used by Lewis et al. (2020b), but also addresses pretraining / test-set overlap.

Injecting New Facts to Update Memory
We evaluate EaE and FILM given full knowledge (the original setting); given filtered knowledge; and given filtered knowledge followed by injecting test-question-related facts into the KB. The gap be-tween the filtered knowledge setting and injected knowledge setting will indicate how well the model incorporates newly introduced facts.
In more detail, we first perform a similar procedure to Lewis et al. (2020b) and discard questions in the fine-tuning training data that contain answers which overlap with answers to questions in the dev and test data. We end up with 9144/2308/3996 data (train/dev/test) in FreebaseQA and 1348/151/1639 data in WebQuestionsSP. Next, to ensure that the model will be unable to memorize paraphrases of question-answer pairs that it observed in the pretraining text, we remove all overlap between the pretraining data and fine-tuning test data: specifically, for every question-answer entity pair in our fine-tuning dataset (from any split), we filter every example from our Wikipedia pretraining corpus in which that pair of entities co-occur. Additionally, we filter every fact from our fact memory containing any of these entity pairs.
In these sections we compare against EaE for two reasons: 1) we are specifically looking at closed-book open domain entity based QA and EaE is shown to be at or near state-of-the-art for that task (Févry et al., 2020), 2) most importantly, we want to be able to precisely control for memorization in the training corpus and therefore did not consider existing unconstrained pre-trained models like T5 (Raffel et al., 2019). For reference, the previous state-of-the-art FOFE  on FreebaseQA had a score of 37.0% using the original train-test split, while FILM is at 63.3%.
The results are shown in Table 5. In the "Full" column, we pretrain and finetune the FILM model with the full knowledge base and corpus. In the "Filter" setting, facts about the finetuning data are hidden from the model at both pretraining and finetuning time. In this case, the model must fall back to the language model to predict the answer, and as shown in Table 5, the accuracies of FILM and EaE are similar. In the "Inject Facts" setting, Facts are hidden at pretraining time, but are injected at test time. The results show that FILM can effectively use the newly injected facts to make prediction, obtaining an absolute improvement of 9.3% compared to the "Filter" setting. EaE does not have a natural mechanism for integrating this new information 9 .   Table 5: Injecting New Facts. In the Filter setting, the models have access to no direct knowledge about question answer entity pairs from either the pretraining corpus or KB. In the Inject setting, the pretraining corpus and training KB are still Filtered, but at inference time, new facts are injected into the models memory allowing it to recover most of the drop from the Full setting.
In the Full setting the model is exposed to full knowledge. In all cases, we remove the overlap between the finetune train and eval sets.

Updating Stale Memories
One of the main motivations for our model is to provide knowledge representations that can be incrementally updated as the world changes, avoiding stale data. In order to accomplish this, the model must learn to utilize the fact memory even in the case where those facts have changed such that they may no longer be consistent with the data the model was initially trained on. Further, it needs to accomplish that without any additional training.
To probe this ability, we simulate an extreme version of stale facts where all answers to QA pairs in the FreebaseQA test set are 'updated' with plausible alternatives. For each QA pair, we replace the original answer entity e original with another entity, e new , from our vocabulary that has: 1) been used as an object in at least one of the same relation types in which e original was used as an object, and 2) shares at least three Wikipedia categories with e original .
We use the same pretrained models from our earlier experiments and fine-tune on the filtered FreebaseQA train set for 10,000 steps. We then modify the memory of this model without applying any additional training on the new memory. In addition to adding new memories which correspond to our newly created facts, we also must remove the original stale facts that we are updating. We look at two methods for filtering those 'stale facts' from the fact memory.
Basic Filter deletes every modified fact e question , r, e original and replaces it with a new fact e question , r, e new . This would be a low recall filter as it does not account for all possible related facts. The Strict Filter is a high recall filter that more aggressively removes information that may conflict with the newly added fact, additionally removing all facts that contain e question or e original . This is important for cases such as when a question contains multiple entities, or the linking relation is one-to-many, leading to multiple plausible answers. Together these two settings define rough bounds on the model's ability to perform this task. In Table 6, we see that FILM is able to utilize the modified KB to make the correct prediction for 54.5% of questions in the Basic Filter setting and 70.3% in the Strict Filter setting.

Model
Basic Filter Strict Filter

Related Work
Symbolic KBs have been a core component of AI since the beginning of the field (Newell and Simon, 1956;Newell et al., 1959), and widely available public KBs have been invaluable in research and industry (Bollacker et al., 2008;Auer et al., 2007;Google, 2012;Dong, 2017;Vrandečić and Krötzsch, 2014). In machine learning, a well studied problem is learning KB embeddings (Bordes et al., 2013;Lin et al., 2015;Trouillon et al., 2017;Dettmers et al., 2018) which enable generalization from known KB triples to novel triples that are plausibly true. KB embeddings can often be improved by incorporating raw text and symbolic KGs into a shared embedding space (Riedel et al., 2013;Verga et al., 2016Verga et al., , 2017, to be jointly reasoned over (Sun et al., 2018. Many prior neural-symbolic methods have attempted to unify symbolic KBs and neural methods (Pinkas, 1991;de Penning et al., 2011;Laird et al., 2017;Besold et al., 2017). Recently, researchers have explored query languages for embedded KBs that are similar to symbolic KB query languages (Cohen et al., 2017;Hamilton et al., 2018;Ren et al., 2020;. Our fact memory builds on this prior work, and is most closely related to the memory used in EmQL , one KB embedding model that supports compositional query language. EmQL implements "projection" using neural retrieval over vectorized KB triples. Unlike this work, however, EmQL did not embed its fact memory into a LM, which could be finetuned for many NLP tasks: instead requiring the implementation of a "neural module" into some task-specific architecture. At a more abstract level, the fact memory is a key-value memory (Weston et al., 2014;Miller et al., 2016), a construct used in many neural models in the past.
It has been shown that sufficiently large LMs trained through self supervision (Peters et al., 2018;Devlin et al., 2019;Raffel et al., 2019;Brown et al., 2020) also encode factual information, motivating work on the extent to which a LM can serve as a KB (Roberts et al., 2020;Petroni et al., 2019;Poerner et al., 2019). Other work has explored techniques to improve the performance of large LMs in answering factual probes, by adding additional supervision in pre-training (Xiong et al., 2019;Wang et al., 2020b) or by adding entity embeddings into an extended LM Zhang et al., 2019;Févry et al., 2020).
Our entity memory extends the Entities-as-Experts (EaE) model (Févry et al., 2020). It is both the current state-of-the-art for a number of tasks and simpler to use than most prior models because it does not require external components for entity linking or entity encoding (like Zhang et al., 2019;) and is not restricted to lexical KBs like WordNet and ConceptNet (like (Weissenborn et al., 2017;Chen et al., 2018;Mihaylov and Frank, 2018)).
Our model's use of memory also scales to KBs with millions of entities, whereas prior systems that make use of KB triples have been with only a few hundreds of triples in the model at any point, necessitating a separate heuristic process to retrieve candidate KB triples (Ahn et al., 2016;Henaff et al., 2016;Weissenborn et al., 2017;Chen et al., 2018;Mihaylov and Frank, 2018;. There have been a few exploratory experiments on modifying the predictions of retrieval augmented language models by changing the underlying text corpus (Guu et al., 2020;Lewis et al., 2020a). However, text passages are not easily interpretable resulting in them being less inspectible and modifiable than a symbolic fact based memory.

Conclusion
We presented FILM, a neural LM with an interpretable symbolically bound fact memory. We demonstrated the effectiveness of this method by outperforming many state-of-the-art methods on four benchmark knowledge intensive datasets. We used the model's symbolic interface to change the output of the LM by modifying only the nonparametric memories, without any additional training. We showed FILM could incorporate newly injected facts unseen during training. Additionally, we can modify facts, such that they contradict the initial pre training text, and our model is still largely able to answer these questions correctly.

Ethics and Broader Impacts
All language models learn to exploit correlations in the data they were trained on. As such, they inherit all of the underlying biases within that data (Zhao et al., 2019;Bender et al., 2021). These models require vast amounts of data to train on and therefore tend to rely on internet corpora which have skewed representations of particular groups, cultures, and languages, as well as variable levels of factuality. Our hope is that research into endowing these models with interpretable and modifiable memories will allow us to more readily identify and remedy some of these failures.

A.1.1 Evaluation Data Statistics
For WebQuestionsSP, we mapped question entities and answer entities to their Wikidata ids. 87.9% of the questions are answerable by at least one answer entity that is mappable to Wikidata. For all questions in FreebaseQA there exists at least one relational path in Freebase between the question entity e i and the answer e ans . The path must be either a one-hop path, or a two-hop path passing through a mediator (CVT) node, and is verified by human raters. 72% of the question entities and 80% of the answer entities are mappable to Wikidata, and 91.7% of the questions are answerable by at least one answer entity that is mappable to Wikidata.  where each mention serves as the masked target for its corresponding example, and other entity mentions in the example are treated as context entities 10 . This conversion results in 85.58 million 10 We mask context entities randomly with probability .15 pre-training examples. The knowledge base K is a subset of Wikidata that contains all facts with subject and object entity pairs that co-occur at least 10 times on Wikipedia pages. 11 This results in a KB containing 1.54 million KB triples from Wikidata (or 3.08 million if reverse triples are included). Below, this is called the full setting of pretraining-we will also train on subsets of this example set, as described below. We pretrain the model for 500,000 steps with the batch size 2048, and we set k = 1 12 in the TOP k operation for fact memory access.

A.2.1 Pretraining
FILM is jointly trained to predict context entities and the masked entity. Context entities are predicted using the contextual embeddings described in §2.3; intermediate supervision with oracle entity linking labels is provided in the entity memory access step for context entities; the masked entity is predicted using the knowledge-enhanced contextual embeddings ( §2.5); and distant supervised fact labels are also provided at training time. The final training loss is the unweighted sum of the four losses: loss pretrain = loss ent + loss ctx + loss fact + loss ans

A.2.2 Finetuning on Question Answering
In the Open-domain Question Answering task, questions are posed in natural language, e.g. "Where was Charles Darwin born?", and answered by a sequence of tokens, e.g. "United Kingdom". In this paper, we focus on a subset of open-domain questions that are answerable using entities from a knowledge base. In the example above, the answer "United Kingdom" is an entity in Wikidata whose identity is Q145.
We convert an open-domain question to an input of FILM by appending the special [MASK] token to the end of the question, e.g. {'Where', 'was', 'Charles', 'Darwin', 'born', '?', [MASK]}. The task is to predict the entity named by mask. Here, "Charles Darwin" is a context entity, which is also referred to as question entity in the finetuning QA task.
At finetuning time, entity embeddings E and relation embeddings R are fixed, and we finetune all transformer layers and the four transformation matrices: W a , W b , W e , W f . Parameters are tuned to optimize unweighted sum of the the fact memory retrieval loss loss fact and the final answer prediction loss loss ans . If multiple answers are available, the training label I eans becomes a k-hot vector uniformly normalized across the answers. loss finetune = loss fact + loss ans

A.3 Model Parameters
The number of Base parameters includes the encoder and (where applicable) decoder transformer parameters derived from the original papers. We exclude token embeddings in this count following prior work. The Memory parameter count for DPR, RAG, and FID includes the number of parameters required to cache and index the full 26 million passage wikipedia corpus with dimension 768 used by those models. For EaE, the Memory is for the entity embedding matrix. For FILM it is both the entity embedding matrix and the cached fact embedding matrix comprised of the 1.7 million precomputed triple embeddings.