On-The-Fly Information Retrieval Augmentation for Language Models

Here we experiment with the use of information retrieval as an augmentation for pre-trained language models. The text corpus used in information retrieval can be viewed as form of episodic memory which grows over time. By augmenting GPT 2.0 with information retrieval we achieve a zero shot 15% relative reduction in perplexity on Gigaword corpus without any re-training. We also validate our IR augmentation on an event co-reference task.


Introduction
We are interested in exploring the value of long term episodic memory in language modeling. For example, a language model can be used in January to assign a probability distribution over the statements that will appear in the newspaper in March. But one month later, in February, the distribution over the predictions for March should be updated to take into account factual developments since the previous prediction. Long term episodic memory should be taken into account when assigning a probability to a statement.
Here we take a simple approach in which a pretrained GPT language model (Radford et al., 2018a(Radford et al., , 2019 is zero-shot augmented with an episodic memory consisting simply of a corpus of past news articles. Conceptually the past news articles are viewed as additional training data which can be legitimately accessed when evaluating on future text. In our most basic experiment we calculate the probability of a future article by first calculating the probability of its first k sentences using the pre-trained GPT model. We then use the first k sentences as a query in an information retrieval system to extract a relevant past article. We then insert the past article following the first k sentences when calculating the probability of the remainder of the future article using the same pre-trained GPT model. This is a zero-shot augmentation in the sense that there is no additional training or fine tuning of the pre-trained model. Our results show that this augmentation significantly reduces perplexity. We also present various other experiments including results on fine-tuning the model in the presence of the memory and the effect of this memory on event co-reference.

Related Work
Various language models have utilized external knowledge or long contexts (Paperno et al., 2016;Yang and Mitchell, 2017;Peng et al., 2019;Khandelwal et al., 2018;Ghosh et al., 2016;Lau et al., 2017;Grave et al., 2016;Parthasarathi and Pineau, 2018). But these papers do not address the question of whether additional context or external knowledge is useful as a zero-shot augmentation of large scale pre-trained NLP models.
The value of external knowledge has previously been demonstrated for NLP tasks such as natural language inference Yang et al., 2019), language generation (Parthasarathi and Pineau, 2018), knowledge base completion (Toutanova et al., 2015;Das et al., 2017) and question answering (Sun et al., 2019(Sun et al., , 2018Dhingra et al., 2017). However, all those prior works assume the model is small and trained from scratch.
As large scale pre-trained models have become more powerful it is not immediately clear whether external resources can still add value. The only work we know of on using external resources in modern large scale models is Yang et al. (2019) where a human curated external lexical resource is used to improve BERT.
Our approach bears some resemblance to neural cache models (Grave et al., 2016). However, neural cache models store past hidden states as memory and accesses them through a dot product with the current hidden states. This is different from retrieving knowledge from a corpus-sized memory.
Our approach is also somewhat related to memory networks (Weston et al., 2014). Memory networks have a memory module which can be learnt jointly with other components. It has shown success in applications such as machine reading comprehension (Kumar et al., 2016a,b;Shi et al., 2016) and visual question answering (Na et al., 2017;Ma et al., 2018;Su et al., 2018). Significant progress in memory networks has been achieved in both architecture (Chandar et al., 2016;Miller et al., 2016;Gulcehre et al., 2017) and model scale (Rae et al., 2016;Lample et al., 2019).
Several papers have formulated, and experimented with, scalable memory networks -memory networks that employ some method of efficiently reading and writing to very large neural memories. This is done with approximate nearest neighbor methods in Rae et al. (2016) and with product keys in Lample et al. (2019). These large memories are used to provide additional model capacity where the memory contents are trained over a large data set using gradient descent training, just as one would train the parameters of a very large network. It is shown in Lample et al. (2019) that it is possible to insert a large memory as a layer in a transformer architecture resulting a model where the same number of parameters and the same performance can be achieved with half the layers and with much faster training time than a standard transformer architecture. Here, however, we are proposing zero-shot augmentation with an external data source used as an episodic memory.
The use of key-value memories in Miller et al. (2016) is particularly similar to our model. Keyvalue memories were used there in treating a corpus of Wikipedia movie pages as a memory for answering questions about movies. As in our system, articles were extracted using word based information retrieval. Each article was encoded as a vector which was then given to a question answering architecture. This was shown to improve on automated knowledge base extraction from the same corpus but was still not competitive with human curated knowledge graphs for movies. Here we give the text of the retrieved article directly to the language model architecture and focus on augmenting large scale language models.

Model
We use the pre-trained transformer GPT 2.0 (Radford et al., 2019). Let W w and W p be the subword and position embeddings respectively. Let M denote the total number of layers, for a token at time step t, the m-th layer's hidden state h m t is given by: where TB stands for Transformer Block. We use last layer's hidden state h M t as the presentation H t for the token at time step t. We augment GPT 2.0 with a large episodic memory component, and the overall architecture is shown in Figure 1. For a sequence S with T tokens, let S 1 , . . ., S p be the tokens of the first k sentences. Let C be a sequence (article) retrieved from memory using the first k sentences as the query, the vector H t is: That's to say, for the first k sentences, we directly feed them to GPT to obtain their representations. For remaining sentences, their representations are conditioned on both the first k sentences and the retrieved context C.

Experiments
We focus on two tasks: document level language modelling and event co-retrieved . In both tasks we take a document as input and use first k sentences to query the memory. To calculate the perplexity of a document, we compute the log-probability of a document by multiplying byte level probability, then divide the log-probability by the actual word count in the query document. We use Gigaword (Parker et al., 2011) as both our language modeling test set and as our external memory. Gigaword contains news from different sources such as NY Times and XinHua News etc. For language modelling we use the NY Times portion because it is written by native English speakers. Since GPT 2.0 is trained on Common Crawl which contains news collections started from 2008. To avoid testing on GPT-2 training data, we use Gigaword articles collected prior to 2008. For the pre-trained language model we use GPT 2.0 (Radford et al., 2019) 1 . It contains three pre-trained models: GPT Small, Medium and Large.
For information retrieval we use Lucene due to its simplicity. Given a query document we first do sentence and word tokenization and then use the first k sentences to retrieve top 20 retrieved documents with the default TF-IDF distance metric provided by Lucene. Since too distant document pairs are uninformative and too related document pairs tends to be duplicates of the test article, we further filter those top ranked documents by time stamp, news source and cosine similarity. More specifically, we choose the highest ranked retrieved document that simultaneously satisfies the following three conditions: it comes from a different news source; it appears earlier but within two weeks time window of the test document, and the bag of word cosine similarity between the test and the retrieved cannot be larger than 0.6α where α is the largest bag of word cosine similarity between the test article and any retrieved articles. To support fine-tuning experiments we constructed a corpus of pairs of a query article and a cached retrieved 1 https://github.com/huggingface/pytorch-transformers document. We split the dataset into train/dev/test by query document's time stamp. The train/dev/test size is: 79622,16927,8045. For zero-shot experiments we use the test set of 8045 articles. We do experiments with k ∈ {1, 2, 5}.
To check the quality of query-retrieved pairs, we randomly sample 100 pairs from dev set and compute the bag of word cosine similarity between the two documents. The mean cosine similarity is 0.15. We also manually inspect them: we ask two NLP researchers to annotate the query-retrieved pair as "BAD" or "OK" independently, i.e., if two documents are almost duplicates or totally unrelated, then it's "BAD", otherwise, it's "OK". Among 100 pairs, 83 pairs are "OK", 17 pairs are "BAD" due to irrelevance. The Cohen's kappa coefficient between two annotations is 0.94.

Language modelling
For language modeling we try zero-shot memory augmentation, fine-tuned memory augmentation, and training a small memory-augmented network from scratch. When training, we use the Adam optimizer from GPT 1.0 (Radford et al., 2018b). The learning rate is 0.001, weight decay parameter is 0.01, the warm up proportion is 0.1. For other parameters, we use the default values from GPT 2.0. The fine-tuning on Gigaword takes less than one day with a single GPU.
Zero-shot and fine-tuning results Following Radford et al. (2019), we first evaluate our model on Gigaword with zero-shot setting and then finetune the model. The results are given in Table 2 Table 2: Perplexity for zero-shot (top 3 rows) and finetuning (last row) settings when use different k to retrieve the context. woc: without retrieved context.
From Table 2, we see that with additional context retrieved from episodic memory, for all different GPT models, we obtain significantly lower perplexity than using original GPT 2.0. When fine tuning the model with context, we can further reduce the overall perplexity. We only fine tune GPT small due to our GPU memory constraints. Preliminary analysis indicates that most of the perplexity reduction comes at content words and semantically rich words where predictions require broader context. This is consistent with the phenomena found in Khandelwal et al. (2018). We further find that smaller k leads to slightly worse retrieval quality, however, more continued sentences will benefit from the retrieved context. Since Gigaword contains newswire, the first several sentences usually are importation summarizations, thus overall, smaller k will result in lower perplexity.
Train from scratch We also investigate training this form of memory-augmented model from scratch on our query-retrieved pairs. For these experiments we train smaller transformers and the results are given in Table 3. From Table 3, we see that additional context still helps and we can get decent perplexity even with quite small models.  When context is irrelevant We also evaluate our method on Wikitext-2/103, in which the retrieved context is irrelevant due to domain difference between Wikipedia and Gigaword. In this case, we use the most top ranked document from Gigaword as reference.

Event Co-reference
Intuitively episodic memory is useful because it contains information about the particular events mentioned in the test document. With this in mind we evaluate our approach on the event co-reference dataset ECB+ (Cybulska and Vossen, 2014). ECB+ contains 982 documents clustered into 43 topics, and has two evaluation settings: coreferring mentions occurring within a single document (within document) or across a document collection (cross document). For the event co-reference pipeline, we follow the joint modeling method of Barhom et al. (2019) where they jointly represented entity and event mentions with various features and learned a pairwise mention/entity scorer for coreference classification. We augment their mention features with the mention's vector representations extracted from either GPT 2.0 or our zero-shot augmented GPT 2.0. For event co-reference, we use the whole test document to retrieve the context from Gigaword. From Table 5 (2018) where they add a clustering-oriented regularization term; CV: Cybulska and Vossen (2015) where they add the feature calculated from "event template"; JM: Barhom et al. (2019). ♣: we also feed the retrieved context to GPT to get the representation.

Conclusion
In this paper we propose a method to augment a pre-trained NLP model with a large episodic memory. Unlike previous work, we use information retrieval to handle a large external corpus of text and feed retrieved documents directly to language models. Evaluation results on language modelling and event co-reference show the promise of our method. To the best of our knowledge, this is the first work that augments pre-trained NLP models with large episodic memory. In principle, the memory-augmented GPT-2 can be used as a variant of GPT-2 for any downstream tasks, such as GLUE tasks , although we have not experimented with that here.