Common Sense or World Knowledge? Investigating Adapter-Based Knowledge Injection into Pretrained Transformers

Following the major success of neural language models (LMs) such as BERT or GPT-2 on a variety of language understanding tasks, recent work focused on injecting (structured) knowledge from external resources into these models. While on the one hand, joint pre-training (i.e., training from scratch, adding objectives based on external knowledge to the primary LM objective) may be prohibitively computationally expensive, post-hoc fine-tuning on external knowledge, on the other hand, may lead to the catastrophic forgetting of distributional knowledge. In this work, we investigate models for complementing the distributional knowledge of BERT with conceptual knowledge from ConceptNet and its corresponding Open Mind Common Sense (OMCS) corpus, respectively, using adapter training. While overall results on the GLUE benchmark paint an inconclusive picture, a deeper analysis reveals that our adapter-based models substantially outperform BERT (up to 15-20 performance points) on inference tasks that require the type of conceptual knowledge explicitly present in ConceptNet and OMCS. We also open source all our experiments and relevant code under: https://github.com/wluper/retrograph.

Starting from this observation, most recent efforts focused on injecting factual (Zhang et al., 2019;Liu et al., 2019a;Peters et al., 2019) and linguistic knowledge (Lauscher et al., 2019;Peters et al., 2019) into pretrained LMs and demonstrated the usefulness of such knowledge in language understanding tasks (Wang et al., 2018(Wang et al., , 2019. Joint pretraining models, on the one hand, augment distributional LM objectives with additional objectives based on external resources (Yu and Dredze, 2014;Nguyen et al., 2016;Lauscher et al., 2019) and train the extended model from scratch. For models like BERT, this implies computationally expensive retraining from scratch of the encoding transformer network. Post-hoc fine-tuning models (Zhang et al., 2019;Liu et al., 2019a;Peters et al., 2019), on the other hand, use the objectives based on external resources to fine-tune the encoder's parameters, pretrained via distributional LM objectives. If the amount of fine-tuning data is substantial, however, this approach may lead to catastrophic forgetting of distributional knowledge obtained in pretraining (Goodfellow et al., 2014;Kirkpatrick et al., 2017).
In this work, similar to the concurrent work of Wang et al. (2020), we turn to the recently proposed adapter-based fine-tuning paradigm (Rebuffi et al., 2018;Houlsby et al., 2019), which remedies the shortcomings of both joint pretraining and standard post-hoc fine-tuning. Adapterbased training injects additional parameters into the encoder and only tunes their values: original transformer parameters are kept fixed. Be-cause of this, adapter training preserves the distributional information obtained in LM pretraining, without the need for any distributional (re-)training. While (Wang et al., 2020) inject factual knowledge from Wikidata (Vrandečić and Krötzsch, 2014) into BERT, in this work, we investigate two resources that are commonly assumed to contain generalpurpose and common sense knowledge: 1 Concept-Net (Liu and Singh, 2004;Speer et al., 2017) and the Open Mind Common Sense (OMCS) corpus (Singh et al., 2002), from which the ConceptNet graph was (semi-)automatically extracted. For our first model, dubbed CN-ADAPT, we first create a synthetic corpus by randomly traversing the Con-ceptNet graph and then learn adapter parameters with masked language modelling (MLM) training (Devlin et al., 2019) on that synthetic corpus. For our second model, named OM-ADAPT, we learn the adapter parameters via MLM training directly on the OMCS corpus.
We evaluate both models on the GLUE benchmark, where we observe limited improvements over BERT on a subset of GLUE tasks. However, a more detailed inspection reveals large improvements over the base BERT model (up to 20 Matthews correlation points) on language inference (NLI) subsets labeled as requiring World Knowledge or knowledge about Named Entities. Investigating further, we relate this result to the fact that ConceptNet and OMCS contain much more of what in downstream is considered to be factual world knowledge than what is judged as common sense knowledge. Our findings pinpoint the need for more detailed analyses of compatibility between (1) the types of knowledge contained by external resources; and (2) the types of knowledge that benefit concrete downstream tasks; within the emerging body of work on injecting knowledge into pretrained transformers.

Knowledge Injection Models
In this work, we are primarily set to investigate if injecting specific types of knowledge (given in the external resource) benefits downstream inference that clearly requires those exact types of knowledge. Because of this, we use the arguably most straightforward mechanisms for injecting the Con-ceptNet and OMCS information into BERT, and leave the exploration of potentially more effective knowledge injection objectives for future work. We 1 Our results in §3.2 scrutinize this assumption. inject the external information into adapter parameters of the adapter-augmented BERT (Houlsby et al., 2019) via BERT's natural objective -masked language modelling (MLM). OMCS, already a corpus in natural language, is directly subjectable to MLM training -we filtered out non-English sentences. To subject ConceptNet to MLM training, we need to transform it into a synthetic corpus.
Unwrapping ConceptNet. Following established previous work (Perozzi et al., 2014;Ristoski and Paulheim, 2016), we induce a synthetic corpus from ConceptNet by randomly traversing its graph. We convert relation strings into NL phrases (e.g., synonyms to is a synonym of ) and duplicate the object node of a triple, using it as the subject for the next sentence. For example, from the path "alcoholism we create the text "alcoholism causes stigma. stigma is used in the context of christianity. christianity is part of religion.". We set the walk lengths to 30 relations and sample the starting and neighboring nodes from uniform distributions. In total, we performed 2,268,485 walks, resulting with the corpus of 34,560,307 synthetic sentences.
Adapter-Based Training. We follow Houlsby et al. (2019) and adopt the adapter-based architecture for which they report solid performance across the board. We inject bottleneck adapters into BERT's transformer layers. In each transformer layer, we insert two bottleneck adapters: one after the multi-head attention sub-layer and another after the feed-forward sub-layer. Let X ∈ R T ×H be the sequence of contextualized vectors (of size H) for the input of T tokens in some transformer layer, input to a bottleneck adapter. The bottleneck adapter, consisting of two feed-forward layers and a residual connection, yields the following output: where W d (with bias b d ) and W u (with bias b u ) are adapter's parameters, that is, the weights of the linear down-projection and up-projection sub-layers and f is the non-linear activation function. Matrix W d ∈ R H×m compresses vectors in X to the adapter size m < H, and the matrix W u ∈ R m×H projects the activated downprojections back to transformer's hidden size H. The ratio H/m determines how many times fewer parameters we optimize with adapter-based training compared to standard fine-tuning of all transformer's parameters.

Evaluation
We first briefly describe the downstream tasks and training details, and then proceed with the discussion of results obtained with our adapter models.

Experimental Setup.
Downstream Tasks. We evaluate BERT and our two adapter-based models, CN-ADAPT and OM-ADAPT, with injected knowledge from ConceptNet and OMCS, respectively, on the tasks from the GLUE benchmark (Wang et al., 2018): Training Details. We inject our adapters into a BERT Base model (12 transformer layers with 12 attention heads each; H = 768) pretrained on lowercased corpora. Following (Houlsby et al., 2019), we set the size of all adapters to m = 64 and use GELU (Hendrycks and Gimpel, 2016) as the adapter activation f . We train the adapter parameters with the Adam algorithm (Kingma and Ba, 2015) (initial learning rate set to 1e −4 , with 10000 warm-up steps and the weight decay factor of 0.01). In downstream fine-tuning, we train in batches of size 16 and limit the input sequences to T = 128 wordpiece tokens. For each task, we find the optimal hyperparameter configuration from the following grid: learning rate l ∈ {2 · 10 −5 , 3 · 10 −5 }, epochs in n ∈ {3, 4}.

Results and Analysis
GLUE Results. Table 1 reveals the performance of CN-ADAPT and OM-ADAPT in comparison with BERT Base on GLUE evaluation tasks. We show the results for two snapshots of OM-ADAPT, after 25K and 100K update steps, and for two snapshots of CN-ADAPT, after 50K and 100K steps of adapter training. Overall, none of our adapterbased models with injected external knowledge from ConceptNet or OMCS yields significant improvements over BERT Base on GLUE. However, we observe substantial improvements (of around 3 points) on RTE and on the Diagnostics NLI dataset (Diag), which encompasses inference instances that require a specific type of knowledge.
Since our adapter models draw specifically on the conceptual knowledge encoded in ConceptNet and OMCS, we expect the positive impact of injected external knowledge -assuming effective injection -to be most observable on test instances that target the same types of conceptual knowledge. To investigate this further, we measure the model performance across different categories of the Diagnostic NLI dataset. This allows us to tease apart inference instances which truly test the efficacy of our knowledge injection methods. We show the results obtained on different categories of the Diagnostic NLI dataset in Table 2. The improvements of our adapter-based models over BERT Base on these phenomenon-specific subsections of the Diagnostics NLI dataset are generally much more pronounced: e.g., OM-ADAPT (25K) yields a 7% improvement on inference that requires factual or common sense knowledge (KNO), whereas CN-ADAPT (100K) yields a 6% boost for inference that depends on lexico-semantic knowledge (LS). These results suggest that (1) ConceptNet and OMCS do contain the specific types of knowledge required for these inference categories and that (2) we managed to inject that knowledge into BERT by training   adapters on these resources.
Fine-Grained Knowledge Type Analysis. In our final analysis, we "zoom in" our models' performances on three fine-grained categories of the Diagnostics NLI dataset -inference instances that require Common Sense Knowledge (CS), World Knowledge (World), and knowledge about Named Entities (NE), respectively. The results for these fine-grained categories are given in Table 3. These results show an interesting pattern: our adapterbased knowledge-injection models massively outperform BERT Base (up to 15 and 21 MCC points, respectively) for NLI instances labeled as requiring World Knowledge or knowledge about Named Entities. In contrast, we see drops in performance on instances labeled as requiring common sense knowledge. This initially came as a surprise, given the common belief that OMCS and ConcepNet contain the so-called common sense knowledge. In contrast, many of the CS inference instances require complex, high-level reasoning, understanding metaphorical and idiomatic meaning, and making far-reaching connections. We display NLI Dignostics examples from the World Knowledge and Common Sense categories in Table 4. In such cases, explicit conceptual links often do not suffice for a correct inference and much of the required knowledge is not explicitly encoded in the external resources. Consider, e.g., the following CS NLI instance: [premise: My jokes fully reveal my character ; hypothesis: If everyone believed my jokes, they'd know exactly who I was ; entailment]. While ConceptNet and OMCS may associate character with personality or personality with identity, the knowledge that the phrase who I was may refer to identity is beyond the explicit knowledge present in these resources. This sheds light on the results in Table 3: when the knowledge required to tackle the inference problem at hand is available in the external resource, our adapter-based knowledge-injected models significantly outperform the baseline transformer; otherwise, the benefits of knowledge injection are neg-

World
The sides came to an agreement after their meeting in Stockholm.
The sides came to an agreement after their meeting in Sweden.

stockholm [partOf]
sweden Musk decided to offer up his personal Tesla roadster.
Musk decided to offer up his personal car.

roadster [isA] car
The Sydney area has been inhabited by indigenous Australians for at least 30,000 years.
The Sydney area has been inhabited by Aboriginal people for at least 30,000 years.
indigenous [synonymOf] aboriginal Common Sense My jokes fully reveal my character.
If everyone believed my jokes, they'd know exactly who I was. The systems thus produced are incremental: dialogues are processed word-byword, shown previously to be essential in supporting natural, spontaneous dialogue.
The systems thus produced support the capability to interrupt an interlocutor midsentence.
He deceitfully proclaimed: "This is all I ever really wanted." He was satisfied. ligible or non-existent. The promising results on world knowledge and named entities portions of the Diagnostics dataset suggest that our methods does successfully inject external information into the pretrained transformer and that the presence of the required knowledge for the task in the external resources is an obvious prerequisite.

Conclusion
We presented two simple strategies for injecting external knowledge from ConceptNet and OMCS corpus, respectively, into BERT via bottleneck adapters. Additional adapter parameters store the external knowledge and allow for the preservation of the rich distributional knowledge acquired in BERT's pretraining in the original transformer parameters. We demonstrated the effectiveness of these models in language understanding settings that require precisely the type of knowledge that one finds in ConceptNet and OMCS, in which our adapter-based models outperform BERT by up to 20 performance points. Our findings stress the importance of having detailed analyses that com-pare (a) the types of knowledge found in external resources being injected against (b) the types of knowledge that a concrete downstream reasoning tasks requires. We hope this work motivates further research effort in the direction of fine-grained knowledge typing, both of explicit knowledge in external resources and the implicit knowledge stored in pretrained transformers.