BIG MOOD: Relating Transformers to Explicit Commonsense Knowledge

We introduce a simple yet effective method of integrating contextual embeddings with commonsense graph embeddings, dubbed BERT Infused Graphs: Matching Over Other embeDdings. First, we introduce a preprocessing method to improve the speed of querying knowledge bases. Then, we develop a method of creating knowledge embeddings from each knowledge base. We introduce a method of aligning tokens between two misaligned tokenization methods. Finally, we contribute a method of contextualizing BERT after combining with knowledge base embeddings. We also show BERTs tendency to correct lower accuracy question types. Our model achieves a higher accuracy than BERT, and we score fifth on the official leaderboard of the shared task and score the highest without any additional language model pretraining.


Introduction
Recently, wide-scale pre-training and deep contextual representations have taken the world by storm. Peters et al. (2018) underscored the importance of bidirectional contextual representations by using traditional neural networks trained on a large corpus of text. Devlin et al. (2018) used transformers (Vaswani et al., 2017) and word masking to pre-train on another large corpus of data, reporting human-level performance on one commonsense dataset (Zellers et al., 2018).  achieves state-of-the-art on RACE (Lai et al., 2017) with a Transformer-XL based model .
The key to success in the performance of many of these models is their ability to train on extremely large datasets. BERT (Devlin et al., 2018), for example, trains on the BooksCorpus (Zhu et al., 2015) and English Wikipedia, for a combined 3,200M words. Other iterations increased the amount of knowledge used during pre-training, such as RoBERTa (Liu et al., 2019). Training large-scale models on these massive datasets has drawbacks, such as significantly increased carbon pollution and harm to the environment (Schwartz et al., 2019;Strubell et al., 2019).
We present a methodology of combining queries from commonsense knowledge bases with contextual embeddings, BIG MOOD -BERT Infused Graphs: Matching Over Other embeDdings, and abbreviated for its relationship to human knowledge awareness. Our methodology achieves a increase without significant additional fine-tuning or pre-training. Instead, it learns a separate representation from commonsense graphical knowledge bases, and augments the BERT representation with this learned explicit representation. We introduce several methods of combining and querying knowledge base embeddings to introduce them to the BERT embedding layers.
2 Related Work 2.1 Knowledge Graphs Significant research has been put into representing human knowledge in various ways (Lenat and Guha, 1989;Auer et al., 2007;Chambers and Jurafsky, 2008). ConceptNet (Speer and Havasi, 2013) contains various aspects of commonsense knowledge through a knowledge graph.The knowledge is collected from crowedsourced resources (Meyer and Gurevych, 2012;Havasi et al., 2010;von Ahn et al., 2006) and expert-created resources (Miller, 1992;Breen, 2004). WebChild (Tandon et al., 2017) is a collection of commonsense knowledge automatically extracted from web contents. The database is constructed similarly to ConceptNet, and intended to cover concepts that ConceptNet does not cover. ATOMIC (Sap et al., 2018) focuses on inferential Passage: I had decided that I wanted to visit my friend Paul whom lives quite a distance away. With this and my fear of air travel in mind I decided to take a train. After researching and finding one online I was well on my way to going to see my friend Paul. I drive to the station and decide that I am going to purchase a round trip ticket as this would be cheaper than just buying both tickets separately. Whenever my train arrives I have to get in line as they process our tickets. After all this is done I decide to take a seat by the window. I sit and fall asleep a bit as I ride on the train for hours. After a couple hours we finally reach the destination and I get off the train, excited to see my friend. When did they wait for their train? a) before buying the ticket b) after buying a ticket Table 1: Example of a prompt from the shared task dataset, an everyday commonsense reasoning dataset. Questions often require script knowledge that extends beyond referencing the text.
If − T hen relations, built for everyday commonsense reasoning.

Knowledge Integration
Knowledge graphs have been applied in various natural language processing applications, such as reading comprehension (Lin et al., 2017;Yang and Mitchell, 2017) and machine translation (Zhang et al., 2017). ERNIE: Enhanced Representation through Knowledge Integration (Sun et al., 2019) appends knowledge to the input of the model and learns via knowledge masking, as well as entitylevel masking and phrase-level masking. TriAN (Wang, 2018), the top public model on the MC-Script (Ostermann et al., 2018) shared task, uses ConceptNet embeddings to highlight relationships between the question, text, and answer.

Model
We present our model for this shared task. Our model has three major components: language model adaptation, knowledge graph embeddings, and attention for classification.

Data Preprocessing
Before model usage, we preprocess the data in two ways to make it easier for the model to un-derstand. For language modeling, we create training data similar to those in BERT (Devlin et al., 2018). For knowledge graph use, we preprocess language to create commonsense object and relationship vocabulary and to match as many related commonsense objects as possible.

Language Model Preprocessing
We prepossess each passage for training. We use this process for each training epoch, since it allows for the most dense pretraining framework.
Commonly known as a cloze task, Devlin et al. (2018) introduced a framework that pretrained transformers (Vaswani et al., 2017) based on masked token prediction. First, we preprocess the tokens with WordPiece embeddings (Wu et al., 2016). Then, we append special [CLS] and [SEP ] to each datum. We append [CLS] to the beginning of each datum, and [SEP ] to separate the question with the answer, as such: Then, we randomly mask 15% of all WordPiece embeddings. Unlike Devlin et al. (2018), we run the randomization script once per each training epoch. Otherwise, we follow the procedure in Devlin et al. (2018). 80% of the time, we replace the word with the [M ASK] prediction, to be replaced through cloze task prediction. 10% of the time, we replace the word with a random word. 10% of the time, we keep the word unchanged.
Combined with the above cloze task, we process the data for next sentence prediction. We do this process after the cloze task masking, similar to Devlin et al. (2018). For each datum, we randomly pick either a sentence labeled correctly as the next sentence 50% of the time, or a random sentence 50% of the time. We ensure that the random sentence is not the next sentence.

Knowledge Graph Processing
We preprocess the data in the shared task along with knowledge graph preprocessing. The purpose of this procedure is to reduce the number of items in the knowledge graph, to speed up fine-tuning since the knowledge graphs are extremely large, and also to ensure matching between as many different types of knowledge graph edges that are relevant as possible.
First, we create an index of (start, end, edge) relationships that match vocabulary within the shared task prompt. For each (start, end, edge), we  (Vaswani et al., 2017). Since the queries work on whole words only, one knowledge base embeddings may be integrated with one or more language embedding. Several self-attention encoding layers are used. check to see if there are any matching prompts in which start is present in the text and end is present in the text. If so, we store the (start, end, edge), and note the edge as a relationship. We also index the relationship (edge), giving an index for each unique relationship.
For longer sequences, we allow matches between any trigram, and store an index for each trigram matched. In addition, we stem words beforehand, to ensure that the different word endings do not effect the result of the matches. We use the Porter Stemmer (Porter, 1980) to stem each word in both the text and the knowledge graph. Note that we only use the stemming to match different words, and do not keep the stemmed words for later use in the process, as to keep comparability between embedding types. We also stem words in knowledge bases, to allow for comprasion. Algorithm 1 shows our process for matching sequences.

Knowledge Graph Usage
We query each of three knowledge bases to create an embedding layer, for each word, for each knowledge graph. Here, we describe our procedure for querying each knowledge graph. We stem words beforehand, to allow for matches agnostic of linguistic postfixes (Merkhofer et al., 2018).

ConceptNet
ConceptNet (Speer and Havasi, 2013) represents everyday words and phrases, with edges between the commonsense relationships between them. We first preprocess ConceptNet, keeping only the vocabulary present in the shared task. Then, for each edge, we store a tuple (agent, dependent, relationship) that describes the commonsense relationship mentioned in the knowledge graph.
During fine-tuning, we check the text for any present agent, dependent pairs. If any word in the text is an agent, and the dependent is present in the text, we add that relationship index as input into the embedding layer. (For agents that span more than one word, such as the phrase "apple pie", we apply the index to the first word, as long as the entire phrase is found in the text). We randomly generate a length 10 embedding for each relationship, and if more than one relationship is matched, we randomly pick one.

WebChild
WebChild (Tandon et al., 2017) is a large collection of commonsense knowledge collected from various sources on the web. The format is similar to ConceptNet, which allows us to follow a similar process. WordNet instances are split into categories part − whole, comparative, property, activity, and spatial. For each category, we capture the (agent, dependent, relationship) tuple, which is usually defined as properties such as x disambi , y disambi , and sub − relation, but is slightly different for each category. We ignore the WordNet (Miller, 1992) relation (some categories will contain subjects such as bike#n#1, and take only the stemmed word. For fine-tuning, we follow the same procedure as ConceptNet, creating an additional 10-length embedding for each word.

ATOMIC
ATOMIC (Sap et al., 2018) is a resource that focuses on inferential knowledge via If − T hen relations. ATOMIC separates its relationships into nine different types (xN eed, xIntent, xAttr, xEf f ect, xReact, xW ant, oEf f ect, oW ant). For each of the nine categories, for each datum in the given category, we search our text for relationships that match the defined If − T hen relationship. Since each relationship is nearly a full sentence, we allow a match to be any trigram matched between the given datum and the text. Then, we append an index [0, 8] to the embedding layer of the first word in the selected trigram based on the type of relationship matched. For fine-tuning, we follow the same procedure as ConceptNet and We-bChild, creating an additional 10-length embedding for each word.

Architecture
Out modeling procedure consists of three parts. First, we query each knowledge graph, allowing us to create embeddings for each specific graph. Then, we describe our word-level knowledge fusion procedure, creating augmented embeddings for each word. Finally, we describe our fine-tuning procedure for the shared task dataset. We modify pytorch-transformers 1 . 1 https://github.com/huggingface/pytorch-transformers

Language Model Fine-Tuning
Contrary to Devlin et al. (2018), we do language model fine-tuning in addition to classification finetuning. We find that this generally provides better results, and allows for more stable accuracy since the shared task involves a small dataset. For each prompt, we use the previous preprocessed data to create tasks for our model to predict. We do this before token realignment, so this happens before any extra knowledge graph embeddings are added to the model architecture. For masked tokens, we predict that token through bidirectional context, the same as Devlin et al. (2018). For next sentence prediction, we use the unbiased method previously introduced as well as in Devlin et al. (2018).

Token Realignment
We do a word-level fusion to incorporate knowledge embeddings into the BERT model. First, we collect word embeddings from BERT. We sum the last four layer of BERT together, as suggested by "The Illustrated BERT, ELMo, and co." 2 . We fuse these embeddings with the embeddings gathered from querying each of the three databases. For each word, we take the dyadic product, or linear fusion, of the contextual BERT embeddings with the concatenation of the three graph embeddings. When there is no related embedding (if the word did not match any edges during querying, or if the word is a BERT-specific token such as [CLS], we do not do any dyadic fusion. Finally, to get a single linear layer, we concatenate each dimension of the result of the dyadic fusion with the original BERT embedding. Algorithm 2 shows a detailed explanation of our token realignment process.

Re-Attention
To get a final result, we do a few more necessary steps. First, we do a single layer of selfattention over the text, allowing each of the wordlevel embeddings to interact with one another. For this self-attention, we follow the same process as in (Vaswani et al., 2017). We compare each token with each other and do token-level fusion with each other to learn an attention embedding layer. Then, we use the sequence embedding for classification. We add a simple linear layer over the sequence embedding for classification, and softmax over the given choices. Note that we do not freeze any weights along the process, allowing the transformer and perceptron to Algorithm 2: Psuedocode for the token realignment algorithm, a method of finding token alignments between two different sequences.
token realignment(seq 1, seq 2): alignment dict = dict seq 1 i = 0 seq 2 i = 0 while seq 1 i <len ( be fine-tuned during this process. We also allow the knowledge embeddings to be modified through this back-propagation. Hyperparameters are noted in Section 4.1. We also ablate our use of this extra attention layer, showing that it is important to learn comparisons between knowledge embeddings. For BERT baselines, we use the process in Devlin et al. (2018), and use the [CLS] token, without attention, for classification.

Hyperparameter Tuning
For hyperparameter tuning with BERT, we find that grid search is the best method. We tune various hyperparameters, including batch size, learning rate, warmup, and epoch count (for hyperparameter details, see appendix). Graph 2 shows the results of several hyperparameters on BERT with our additional knowledge bases. We find that B. MOOD seems to correct its deficiencies as it gets closer to the maxima. Interestingly, B. MOOD seems to be naturally good "What" questions, which commonly require commonsense inference. This could be explained by the effect of the commonsense knowledge graphs, showing that is picking up on commonsense attributes. How- Figure 2: Example of B. MOOD accuracy across categories during hyperparameter turning. Values to the right are closer to the maxima. ever, for "Where" questions, which it requires more information from the text, B. MOOD needs to learn and thus experiences a greater gain as the accuracy gets closer to its maxima.
We also compare to TriAN (Wang, 2018), the previous state-of-the-art. 83.3 80.7 Table 2: Results with B. MOOD on task dev and test set. "with all KB" describes results using all Concept-Net, WebChild, and ATOMIC embeddings. "Human" and "Regression Baseline" accuracy is from the shared task paper (Ostermann et al., 2018). TriAN (Wang, 2018)   embeddings, we use a size of 10 for each knowledge graph, combining for a size 30 knowledge graph embedding. We randomly init each embedding, and if there is more than one embedding for token, we pick one at random (Wang, 2018). For BERT fine-tuning, we use a maximum sequence length of 450, a train batch size of 32, four epochs, 1e − 5 learning rate, and a 20% warmup.

Results
We show our results and give analysis for MOOD. We show that each of the knowledge bases help the accuracy of our model, and our strongest model involves the union of all three knowledge bases. ConceptNet gives the largest increase, likely because there are the most matches between the prompts and ConceptNet, since ConceptNet covers everyday concepts that are relatively more common. WebChild gives a boost also, but not as large as ConceptNet. ATOMIC gives the small-est boost, likely because 1) ATOMIC queries are the longest, and thus, least likely to match, and 2) there is not as much inferential commonsense present.
We also note that the base B. MOOD accuracy is higher than the base TriAN (Wang, 2018) accuracy, the previous state of the art. By appending similar knowledge embeddings, we find that we can bring the TriAN accuracy up to 77.8%, which is more comparable with MOOD. This shows that the additional knowledge bases (ATOMIC, We-bChild) contribute to the overall accuracy even without the contextual embeddings. However, we find that the knowledge bases combined with TriAN still do not provide an improvement above that of MOOD, and thus, the knowledge bases alone are not enough to capture the necessary information. Instead, the knowledge graphs must be used through combination with contextual embeddings for the most effective model. This shows that BERT may lack the complete amount of information needed to understand this dataset. We also show that the attention is needed to understand the knowledge graphs alongside BERT, showing the importance of learning the different knowledge base embeddings within the text. This highlights the fact that using the knowledge base embeddings is helpful, and also comparisons between different sections of text is helpful for reading comprehension tasks.

Conclusion
We introduce a method of fine-tuning with graphical embeddings alongside contextual embeddings, MOOD. Our method uses three different knowledge bases, and introduces ways of improving both learning speed and knowledge embedding effectiveness. First, we preprocess the dataset, showing that both language model preprocessing and knowledge graph preprocessing is important to the final result. Then, we tune our language model on the shared task, stabilizing the hyperparameter search. We create knowledge graph embeddings and concatenate the embeddings via token realignment. Then, we introduce a final layer of attention that learns both contextual and explicit graph embeddings through contextualization. We show the effect of various knowledge bases, and show our accuracy across various question types. Our model gets fifth on the task leaderboard and outperforms BERT across all question types. We hope that this investigation motivates and furthers additional research in combining commonsense knowledge awareness with transformers.