Are Pretrained Language Models Symbolic Reasoners over Knowledge?

How can pretrained language models (PLMs) learn factual knowledge from the training set? We investigate the two most important mechanisms: reasoning and memorization. Prior work has attempted to quantify the number of facts PLMs learn, but we present, using synthetic data, the first study that investigates the causal relation between facts present in training and facts learned by the PLM. For reasoning, we show that PLMs seem to learn to apply some symbolic reasoning rules correctly but struggle with others, including two-hop reasoning. Further analysis suggests that even the application of learned reasoning rules is flawed. For memorization, we identify schema conformity (facts systematically supported by other facts) and frequency as key factors for its success.

Recent work on knowledge captured by PLMs is focused on probing, a methodology that identifies the set of facts a PLM has command of. But little is understood about how this knowledge is acquired during pretraining and why. We analyze the ability of PLMs to acquire factual knowledge focusing on two mechanisms: reasoning and memorization. We pose the following two questions: a) Symbolic reasoning: Are PLMs able to infer knowledge not seen explicitly during pretraining? b) Memorization: Which factors result in successful memorization of a fact by PLMs? * *equal contribution We conduct our study by pretraining BERT from scratch on synthetic corpora. The corpora are composed of short knowledge-graph like facts: subjectrelation-object triples. To test whether BERT has learned a fact, we mask the object, thereby generating a cloze-style query, and then evaluate predictions.
Symbolic reasoning. We create synthetic corpora to investigate six symbolic rules (equivalence, symmetry, inversion, composition, implication, negation); see Table 1. For each rule, we create a corpus that contains facts from which the rule can be learned. We test BERT's ability to use the rule to infer unseen facts by holding out some facts in a test set. For example, for composition, BERT should infer, after having seen that leopards are faster than sheep and sheep are faster than snails, that leopards are faster than snails.
Our setup is similar to link prediction in the knowledge base domain and therefore can be seen as a natural extension of the question: "Language models as knowledge bases?" (Petroni et al., 2019). In the knowledge base domain, prior work (Sun et al., 2019;Zhang et al., 2020) has shown that models that are able to learn symbolic rules are superior to ones that are not. Talmor et al. (2019) also investigate symbolic reasoning in BERT using cloze-style queries. However, in their setup, there are two possible reasons for BERT having answered a cloze-style query correctly: (i) the underlying fact was correctly inferred or (ii) it was seen during training. In contrast, since we pretrain BERT from scratch, we have full control over the training setup and can distinguish cases (i) and (ii).
difference -note that human inference similarly does not require that all relevant facts are explicitly repeated at inference time.
We find that i) BERT is capable of learning some one-hop rules (equivalence and implication). ii) For others, even though high test precision suggests successful learning, the rules were not in fact learned correctly (symmetry, inversion and negation). iii) BERT struggles with two-hop rules (composition). However, by providing richer semantic context, even two-hop rules can be learned.
Given that BERT can in principle learn some reasoning rules, the question arises whether it does so for standard training corpora. We find that BERTlarge has only partially learned the types of rules we investigate here. For example, BERT has some notion of "X shares borders with Y" being symmetric, but it fails to understand rules like symmetry in other cases.
Memorization. During the course of pretraining, BERT sees more data than any human could read in a lifetime, an amount of knowledge that surpasses its storage capacity. We simulate this with a scaled-down version of BERT and a training set that ensures that BERT cannot memorize all facts in training. We identify two important factors that lead to successful memorization. (i) Frequency: Other things being equal, low-frequency facts are not learned whereas frequent facts are. (ii) Schema conformity: Facts that conform with the overall schema of their entities (e.g., "sparrows can fly" in a corpus with many similar facts about birds) are easier to memorize than exceptions (e.g., "penguins can dive").

Data
To test PLMs' reasoning capabilities, natural corpora like Wikipedia are limited since it is difficult to control what the model sees during training. Synthetic corpora provide an effective way of investigating reasoning by giving full control over what knowledge is seen and which rules are employed in generating the data.
In our investigation of PLMs as knowledge bases, it is natural to use (subject, relation, object) triples as basic units of knowledge; we refer to them as facts. The underlying vocabulary consists of a set of entities e, f, g, ... ∈ E, relations r, s, t, ... ∈ R and attributes a, b, c, ... ∈ A, all represented by artificial strings such as e 14 , r 3 or a 35 . Two types of facts are generated. (i) Attribute facts: relations linking entities to attributes, e.g., (e, r, a) = (leopard, is, fast). (ii) Entity facts: relations linking entities, e.g., (e, r, f ) = (Paris, is the capital of, France).
In the test set, we mask the objects and generate cloze-style queries of the form "e r [MASK]". The model's task is then to predict the correct object. Table 1 gives definitions and examples for the six  rules (EQUI, SYM, INV, COMP, IMP, NEG) we investigate. The definitions are the basis for our corpus generation algorithms, shown in Figure 1. SYM, INV, COMP generate entity facts and EQUI, IMP, NEG attribute facts. We create a separate corpus for each symbolic rule. Facts are generated by sampling from the underlying vocabulary. For §2.1, this vocabulary consists of 5000 entities, 500 relations and 1000 attributes. Half of the relations follow the rule, the other half is used to generate random facts of entity or attribute type.

Symbolic Reasoning
We can most easily think of the corpus generation as template filling. For example, looking at SYM in Table 1, the template is (e, r, f ) ⇐⇒ (f, r, e). We first sample a relation r from R and if addC then C = C ∪{(e, r, a)} D = D∪{(e, s, a)} else C = C ∪{(e, s, a)} D = D∪{(e, r, a)} if negated then C = C∪{(e, not r, a)} D = D ∪{(e, r, b)} else C = C ∪{(e, r, a)} D = D∪{(e, not r, b)} Figure 1: Pseudocode for symbolic reasoning corpus generation. "a ∼ A" stands for: a is randomly sampled from A. ("α ∼ A × . . . × A": a tuple of 4 attributes is sampled.) The vocabulary consists of entities e, f, g ∈ E, relations r, s, t ∈ R and attributes a, b, c ∈ A. Train/test corpora are formed from C and D. n = 20, m = 800, l = 2. See §2.1 for details. then two entities e and f from E. We then add (e, r, f ) and (f, r, e) to the corpus -this is one instance of applying the SYM rule from which symmetry can be learned. Similarly, the other rules also generate instances. For each of the other rules, the template filling is modified to conform with its definition in Table 1. INV corresponds directly to SYM. COMP is a two-hop rule whereas the other five are one-hop rules. EQUI generates instances from which one can learn that the relations r and s are equivalent. IMP generates implication instances, e.g., (e, r, b) (= (dog, is, mammal)) implies (e, s, a 1 ) (= (dog, has, hair)), (e, s, a 2 ) (= (dog, has, neocortex)) etc. Per premise we create four implied facts.
For NEG, we generate pairs of facts (e, r, a) (= (jupiter, is, big)) and (e, not r, b) (= (jupiter, is not, small)). We define the antonym function in Figure 1 (NEG) as returning for each attribute its antonym, i.e., attributes are paired, each pair consisting of a positive and a negative attribute. Each of the six generation algorithms has the outer loop "for i ∈ 1. . . n" (where n = 20) that samples one, two or three relations (and potentially attributes) and generates a subcorpus for these relations; and the inner loop "for j ∈ 1. . . m" (where m = 800) that generates the subcorpus of instances for the sampled relations.
Train/test split. The data generation algorithms generate two subsets of facts C and D, see Figure 1. For each rule, we merge all of C with 90% of D (randomly sampled) to create the training set. The rest of D (i.e., the other 10%) serves as the test set.
For some of the cloze queries "e r [MASK]", there are multiple correct objects that can be substituted for MASK. Thus, we rank predictions and compute precision at m, i.e., precision in the top m where m is the number of correct objects. We average precision at m for all cloze queries.
This experimental setup allows us to test to what extent BERT learns the six rules, i.e., to what extent the facts in the test set are correctly inferred from their premises in the training set.

Memorization
For memorization, the vocabulary consists of 125,000 entities, 20 relations and 2250 attributes.
Effect of frequency on memorization. Our first experiment tests how the frequency of a fact influences its successful memorization by the model. Figure 2 (left, FREQ) gives the corpus generation algorithm. The outer loop generates 800,000 random facts. These are divided up in groups of 8000. A fact in the first group of 8000 is added once to the corpus, a fact from the second group is added twice and so on. A fact from the last group is added 100 times to the corpus. The resulting corpus C is both the training set and the test set.
Effect of schema conformity. In this experiment, we investigate the hypothesis that a fact can be memorized more easily if it is schema conformant. Figure 2 (right, SCHEMA) gives the corpus generation algorithm. We first sample an entity group: δ ∼ E × . . . × E. For each group, relations are either related to the schema ("if schema") or are not (else clause). For example, for the schema "primate" the relations "eat" (eats fruit) and "climb" (climbs trees) are related to the schema, the relation "build" is not since some primates build nests and treehouses, but others do not.
For non-schema relations, facts with random attributes are added to the corpus. In Figure 4, we refer to these facts as (facts with) unique attributes. For relations related to the schema, we sample the attributes that are part of the schema: α ∼ A × ... × A (e.g., ("paranut",. . . ,"banana") for "eat"). Facts are then generated involving these attributes and added to the corpus. In Figure 4, we refer to these facts as (facts with) group attributes. We also generate exceptions (e.g., "eats tubers") since schemas generally have exceptions.
Similarly, the two lines "add = Bernoulli(0.5)" are intended to make the data more realistic: for a group of entities, its relations and its attributes, the complete cross product of all facts is not available to the human learner. For example, a corpus may contain sentences stating that chimpanzees and baboons eat fruit, but none that states that gorillas eat fruit.
For this second memorization experiment, training set and test set are again identical (i.e., = C).
In a final experiment, we modify SCHEMA as follows: exceptions are added 10 times to the corpus (instead of once). This tests the interaction between schema conformity and frequency.

BERT Model
BERT uses a deep bidirectional Transformer (Vaswani et al., 2017) encoder to perform masked language modeling. During pretraining, BERT randomly masks positions and learns to predict fillers. We use source code provided by Wolf et al. (2019). Following (Liu et al., 2019), we perform dynamic masking and no next sequence prediction.  For symbolic rules, we start with BERT-base and tune hyperparameters. We vary the number of layers to avoid that rule learning fails due to over-parametrization, see appendix for details. We report precision based on optimal configuration.
In the memorization experiment, our goal is to investigate the effect of frequency on memorization. Due to a limited compute infrastructure, we scale down BERT to a single hidden layer with 3 attention heads, a hidden size of 192 and an intermediate size of 768. Table 2 gives results for the symbolic reasoning experiments. BERT has high test set precision for EQUI, SYM, INV and IMP. As we see in Table 1, these rules share that they are "one-hop": The inference can be straightforwardly made from a single premise to a conclusion, e.g., "(barack married michelle)" implies "(michelle married barack)". The crucial difference to prior work is that the premise is not available at inference time. "(michelle married barack)" is correctly inferred by the model based on its memory of having seen the fact "(barack married michelle)" in the training set and based on the successful acquisition of the symmetry rule. Table 2 seems to suggest that BERT is able to learn one-hop rules and it can successively apply these rules in a natural setting in which the premise is not directly available.

Symbolic Reasoning
In the rest of this section, we investigate these results further for SYM, INV, NEG and COMP. Table 2 seems to indicate that BERT can learn that a relation r is symmetric (SYM) and that s and t are inverses (INV) -the evidence is that it generates facts based on the successfully acquired symmetry and inversion properties of the relations r, s and t. We now show that while BERT acquires SYM and INV partially, it also severely overgenerates. Our analysis points to the complexity of evaluating rule learning in PLMs and opens interesting avenues for future work.

Analysis of SYM and INV
Our first observation is that in the SYM experiment, BERT understands all relations to be symmetric. Recall that of the total of 500 relations, 250 are symmetric and 250 are used to generate random facts. If we take a fact with a random relation r, say (e, r, f ), and prompt BERT with "(f, r, [M ASK])", then e is predicted in close to 100% of cases. So BERT has simply learned that any relation is symmetric as opposed to distinguishing between symmetric and non-symmetric relations.
This analysis brings to light that our setup is unfair to BERT: it never sees evidence for nonsymmetry. To address this, we define a new experiment, which we call ANTI because it includes an additional set of "anti" relations that are sampled from R * with R * ∩ R =∅ and |R| = |R * |. ANTI facts take the following form: (e, r, f ), (f, r, g) with e = g. Using this ANTI template we follow the standard data generation procedure. The corpus is now composed of symmetric, anti-symmetric and random facts. ANTI training data indicate to BERT that r ∈ R * is not symmetric since many instances of r facts are seen, with specific entities (f in the example) occurring in both slots, but there is never a symmetric example.
Table 2 (ANTI) shows that BERT memorizes ANTI facts seen during training but on test, BERT only recognizes 14.85% of ANTI facts as nonsymmetric. So it still generalizes from the 250 symmetric relations to most other relations (85.15%), even those without any "symmetric" evidence in training. So it is easy for BERT to learn the concept of symmetry, but it is hard to teach it to distinguish between symmetric and non-symmetric relations.
Similar considerations apply to INV. BERT successfully predicts correct facts once it has learned that s and t are inverses -but it overgeneralizes by also predicting many incorrect facts; e.g., for (e, s, f ) in train, it may predict (f, t, e) (correct), but also (e, t, f ) and (f, s, e) (incorrect).
In another INV experiment, we add, for each pair of (f, r, e) and (e, s, f ) two facts that give evidence of non-symmetry: (f, r, g) and (e, s, h) with e = g and h = f . We find that test set precision for INV (i.e., inferring (e, s, f ) in test from (f, r, e) in train) drops to 17% in this scenario. As for SYM, this indicates how complex the evaluation of rule learning is.
In summary, we have found that SYM and INV are learned in the sense that BERT generates correct facts for symmetric and inverse relations. But it severely overgenerates. Our analysis points to a problem of neural language models that has not received sufficient attention: they can easily learn that the order of arguments is not important (as is the case for SYM relations), but it is hard for them to learn that this is the case only for a subset of relations. Future work will have to delineate the exact scope of this finding -e.g., it may not hold for much larger training sets with millions of occurrences of each relation. Note, however, that human learning is likely to have a bias against symmetry in relations since the vast majority of verbs 2 in English (and presumably relations in the world) is asymmetric. So unless we have explicit evidence for symmetry, we are likely to assume a relation is non-symmetric. Our results suggest that neural language models do not have this bias -which would be problematic when using them for learning from natural language text.

Analysis of NEG
NEG was the only rule for which parameter tuning improved performance. A reduction to four layers obtained optimal results.
In Table 2 we report a test set precision of 20.54%. Why is negation more challenging than implication? Implication allows the model to generalize over several entities all following the same rule (e.g., every animal that is a mammal has a neocortex). This does not hold for negation (e.g., a leopard is fast but a snail is not fast). BERT must learn antonym negation from a large number of possible combinations. By reducing the number of possible combinations (decreasing the number of attributes from 1000 to 500, 250 and 125) BERT's test set precision increases, see Figure 3 (A). With 125 attributes a precision of 91% is reached. A reduction of attributes makes antonym negation very similar to implication.
We investigate BERT's behavior concerning negation further by adding an additional attribute set A * , with A * ∩ A =∅ and |A| = |A * | to the vocabulary. A * does not follow an antonym schema. We sample a ∈ A * , e ∈ E, r ∈ R to add additional random facts of the type (e, r, a) or (e, not r, a) to NEG's training set. After training we test on the additional random facts seen during training by inserting or removing the negation marker. We see that BERT is prone to predict both (e, r, b) and (e, not r, b) for b ∈ A * (for 38%). Antonym negation was still learned.
We conclude that antonym negation can be learned via co-occurrences but a general concept of negation is not understood. This is in agreement with prior work (Ettinger, 2020;Kassner and Schütze, 2020) showing that BERT trained on natural language corpora is as likely to generate a true statement like "birds can fly" as a factually false negated statement like "birds cannot fly".

Analysis of COMP
Why does BERT not learn COMP? COMP differs from the other rules in that it involves two-hop reasoning. Recall that a novelty of our experimental setup is that premises are not presented at inference time -two-hop reasoning requires that two different facts have to be "remembered" to make the inference, which intuitively is harder than a onehop inference. Figure 3 (B) shows that the problem is not undertraining (orange line).
Similar to the memorization experiment, we investigate whether stronger semantic structure in form of a schema can make COMP learnable. We refer to this new experiment as COMP enhanced. Data generation is defined as follows: Entities are divided into groups of 10. Relations are now defined between groups in the sense that the members of a group are "equivalent". More formally, we sample entity groups (groups of 10) E 1 , E 2 , E 3 and relations r, s, t. For all e 1 ∈ E 1 , e 2 ∈ E 2 , e 3 ∈ E 3 , we add (e 1 , r, e 2 ) and (e 2 , s, e 3 ) to C and (e 1 , t, e 3 ) to D. In addition, we introduce a relation "samegroup" and add, for all e m , e n ∈ E i , (e m , samegroup, e n ) to C -this makes it easy to learn group membership. As before, the training set is the merger of C and 90% of D and the test set is the rest of D.
Similar semantic structures occur in real data. The simplest case is a transitive example: (r) planes (group 1) are faster than cars (group 2), (s) cars (group 2) are faster than bikes (group 3), (t) planes (group 1) are faster than bikes (group 3). Figure 3 (B) shows that BERT can learn COMP moderately well from this schema-enhanced corpus (blue curve): precision is clearly above 50% and peaks at 76%.
The takeaway from this experiment is that twohop rules pose a challenge to BERT, but that they are learnable if entities and relations are embedded in a rich semantic structure. Prior work (Brown et al., 2020) has identified the absence of "domain models" (e.g., a domain model for common sense physics) as one shortcoming of PLMs. To the extent that PLMs lack such domain knowledge (which we simulate here with a schema), they may not be able to learn COMP.

Natural Language Corpora
In this section, we investigate to what extent the PLMs BERT and RoBERTa have learned SYM and INV from natural language corpora. See Table 3. For "smaller/larger" (INV), we follow Talmor et al. (2019) and test which of the two words is selected as the more likely filler in a pattern like "Jupiter is [MASK] than Mercury". For the other three relations ("shares borders with" (SYM), "is the opposite of" (SYM), "is the capital of" / "'s capital is" (INV)), we test whether the correct object is predicted in the pattern "e r [MASK]" (as in the rest of the paper). We give the number of (i) consistent ("cons."), (ii) correct and consistent ("correct") and (iii) inconsistent ("inc.") predictions. (A prediction is consistent and incorrect if it is consistent with the rule, but factually incorrect.) In more detail, we take a set of entities (countries like "Indonesia", cities like "Jakarta") or adjectives like "low" that are appropriate for the relation and test which of the entities / adjectives is predicted. For each of the five relations, we run both BERT-large-cased and RoBERTa-large and report the more consistent result.   Talmor et al. (2019) and test which of the two words is selected as a filler in a pattern like "Jupiter is [MASK] than Mercury". For the other three relations, we test whether the correct object is predicted (as in the rest of the paper). We give the number of (i) consistent ("cons."), (ii) correct and consistent ("correct") and (iii)  Consistency and accuracy are high for "shares borders with" and "capital". However, this is most likely due to the fact that many of these facts occur verbatim in the training corpora of the two models. For example, Google shows 54,800 hits for "jakarta is the capital of indonesia" and 1,290 hits for "indonesia's capital is jakarta" (both as a phrase). It is not possible to determine which factor is decisive here: successful rule-based inference or memorization. The ultimate futility of this analysis is precisely the reason that we chose to work with synthetic data.
Consistency for "is the opposite of" is much lower than for the first two relations, but still decent. To investigate this relation further, we also tested the relation "is the same as". It turns out that many of the "opposite" objects are also predicted for "is the same as", e.g., "high is the same as low" and "low is the same as high" where the predicted word is in italics. This indicates that the models have not really learned that "is the opposite of" is symmetric, but rather know that antonyms are closely associated and often occur together in phrases like "X is the opposite of Y", "X and Y", "X noun, Y noun" (e.g., "good cop, bad cop") etc. Apparently, this is then incorrectly generalized to "is the same as".
Consistency and accuracy are worse for "smaller/larger". "smaller/larger" sentences of the sort considered here are probably rarer in genres like Wikipedia than "shares borders with" and "is the capital of". A Wikipedia article about a country will always say what its capital is and which countries it borders, but it will not enumerate the countries that are smaller or larger.
In summary, although we have shown that pretrained language models have some ability to learn symbolic rules, there remains considerable doubt that they can do so based on natural corpora.

Memorization
Experimental results for the memorization experiments are shown in Figure 4.
(A) shows that frequent facts are memorized well (0.8 for frequency 100) and that rare facts are not (≈ 0.0 for frequencies < 15).
(B) shows that BERT memorizes schema conformant facts perfectly ("group attributes"). Accuracy for exceptions is clearly lower than those of schema conformant facts: about 80%. The frequency of each fact in the training corpus in this experiment is 1. Overall, the total amount of exceptions is much lower than the total amount of schema conformant facts.
(C) shows that exceptions are perfectly learned if 10 copies of each exception are added to the corpus -instead of 1 in (B). In this case, limited capacity affects memorization of schema-conformant facts: accuracy drops to ≈ 0.9.
In summary, we find that both frequency and schema conformity facilitate memorization. Schema conformant facts and exceptions compete for memory if memory capacity is limited -depending on frequency one or the other is preferentially learned by BERT.

Limitations
Our experimental design makes many simplifying assumptions: i) Variation in generated data is more limited than in naturally occurring data. ii) Semantics are deliberately restricted to one rule only per generated corpus. iii) We do not investigate effects of model and corpus size.
i) In natural corpora relations can have more than two arguments, entities can have several tokens, natural data are noisier than synthetic data etc. Also, we study each rule in isolation.
ii) While our simplified corpora make learning easier in some respects, they may make it harder in others. Each corpus is focused on providing training material for one symbolic rule, but it does not contain any other "semantic" signal that may be helpful in learning symbolic reasoning: distributional signals, entity groupings, hierarchies, rich context etc. The experimental results of "COMP enhanced" indicate that indeed such signals are beneficial to symbolic rule learning. The interplay of such additional sources of information for learning with symbolic rules is an interesting question for follow up work.
iii) Results are based on BERT-base and scaleddown versions of BERT-base only, just as training corpora are orders of magnitude smaller than natural training corpora. We varied model and corpus sizes within the limits of our compute infrastructure, but did not systematically study their effect on our findings.
Our work is an initial exploration of the question whether symbolic rules can be learned in principle, but we view it mainly as a starting point for future work.
6 Related Work Radford et al. (2019) and Petroni et al. (2019) show in a zero-shot question answering setting that PLMs have factual knowledge. Our main question is: under what conditions do PLMs learn factual knowledge and do they do so through memorization or rule-based inference? Sun et al. (2019) and Zhang et al. (2020) show in the knowledge graph domain that models that have the ability to capture symbolic rules like SYM, INV and COMP outperform ones that do not. We investigate this question for PLMs that are trained on language corpora. Talmor et al. (2019) test PLMs' symbolic reasoning capabilities probing pretrained and finetuned models with cloze-style queries. Their setup makes it impossible to distinguish whether a fact was inferred or memorized during pretraining. Our synthetic corpora allow us to make this distinction. Clark et al. (2020) test finetuned BERT's reasoning capabilities, but they always make premise and conclusion locally available to the model, during training and inference. This is arguably not the way much of human inference works; e.g., the fact F that X borders Y allows us to infer that Y borders X even if we were exposed to F a long time ago.  introduce synthetic corpora testing logic and monotonicity reasoning. They show that BERT performs poorly on these new datasets, but can be quickly finetuned to good performance. The difference to our work again is that they make the premise available to the model at inference time.
For complex reasoning QA benchmarks (Yang et al., 2018;Sinha et al., 2019), PLMs are finetuned to the downstream tasks. Their performance is difficult to analyze: it is not clear whether any reasoning capability is learned by the PLM or by the task specific component.
Another line of work (Gururangan et al., 2018;Kaushik and Lipton, 2018;Dua et al., 2019;Mc-Coy et al., 2019) shows that much of PLMs' performance on reasoning tasks is due to statistical artifacts in datasets and does not exhibit true reasoning and generalization capabilities. With the help of synthetic corpora, we can cleanly investigate PLMs' reasoning capabilities. Hupkes et al. (2020) study the ability of neural models to capture compositionality. They do not investigate our six rules, nor do they consider the effects of fact frequency and schema conformity. Our work confirms their finding that transformers have the ability to capture both rules and exceptions.
A large body of research in psychology and cognitive science has investigated how some of our rules are processed in humans, e.g., Sloman (1996) for implication. There is also a lively debate in cognitive science as to how important rule-based reasoning is for human cognition (Politzer, 2007). Yanaka et al. (2020); Goodwin et al. (2020) are concurrent studies of systematicity in PLMs. The first shows that monotonicity inference is feasible for syntactic structures close to the ones observed during training. The latter shows that PLMs can exhibit high over-all performance on natural language inference despite being non-systematic. Roberts et al. (2020) show that the amount of knowledge captured by PLMs increases with model size. Our memorization experiments investigate the factors that determine successful acquisition of knowledge. Guu et al. (2020) modify the PLM objective to incentivize knowledge acquisition. They do not consider symbolic rule learning nor do they analyze what factors influence successful memorization.
Based on perceptrons and convolutional neural networks, Arpit et al. (2017); Zhang et al. (2017) study the relation of generalizing from real structured data vs. memorizing random noise in the image domain, similar to our study of schemaconformant facts and outliers. They do not study transformer based models trained on natural language.

Conclusion
We studied BERT's ability to capture knowledge from its training corpus by investigating its reasoning and memorization capabilities. We identified factors influencing what makes successful memorization possible and what is learnable beyond knowledge explicitly seen during training. We saw that, to some extent, BERT is able to infer facts not explicitly seen during training via symbolic rules.
Overall, effective knowledge acquisition must combine both parts of this paper: memorization and symbolic reasoning. A PLM is not able to store an unlimited amount of knowledge. Through acquiring reasoning capabilities, knowledge gaps can be filled based on memorized facts. A schemaconformant fact ("pigeons can fly") need not be memorized if there are a few facts that indicate that birds fly and then the ability of flight can be filled in for the other birds. The schema conformity experiments suggest that this is happening. It is easier to capture knowledge that conforms with a schema instead of memorizing facts one by one.
There are several directions for future work. First, we made many simplifying assumptions that should be relaxed in future work. Second, how can we improve PLMs' ability to learn symbolic rules? We see two avenues here, either additional inductive biases could be imposed on PLMs' architectures or training corpora could be modified to promote learning of symbolic rules.

A.1 Model hyperparameters
For all reported results we trained with a batch-size of 1024 and a learning rate of 6e-5.
Our experiments for symbolic rules started with the BERT-base model with 12 layers, 12 attention heads, hidden size of 768 and intermediate size of 3072. For rules with a low test precision (NEG and COMP) we then conducted a restricted grid search (restricted due to limited compute infrastructure): We tried all possible numbers of layers from 1 to 12 and then only considered the best result. For NEG the best performance came from 4 layers, whereas COMP did not show improvements for any number of layers. For NEG with 3 layers (which had a very similar performance to 4 layers) we exemplarily tested whether changing the attention heads, hidden size or intermediate size improves precision. For this we trained with the following 4 settings: • attention heads = 6, hidden size = 768, intermediate size = 3072 • attention heads = 12, hidden size = 384, intermediate size = 1536 • attention heads = 12, hidden size = 192, intermediate size = 768 • attention heads = 12, hidden size = 96, intermediate size = 192 However this did not further improve precision.

A.2 Data hyperparameters
In previous iterations of our experiments, we had used different settings for generating our data. For instance, we had varied the number of rules in our corpora: 50 or 100 instead of the presented 20 rules. Even the sampling process itself can be tweaked to allow for less overlaps between rules and between instances of one rule. However, we observed the same trends and similar numbers across these different settings.

B Symbolic rules
In the following sections, we present illustrating corpora for INV, IMP and COMP enhanced. Each line is one datapoint. We also include the control group at the end of each corpus that does not follow any rule. In the case of composition enhanced, "{...}" indicates the sampled group which is not part of the actual dataset. We illustrate our training corpora using real world entities and relations. Note that the actual corpora used for training are composed of an entirely synthetic vocabulary. For simplicity we show grouped composition with enhancement with groups of 4, instead of 10 as it is in the real data.