Repeat before Forgetting: Spaced Repetition for Efficient and Effective Training of Neural Networks

We present a novel approach for training artificial neural networks. Our approach is inspired by broad evidence in psychology that shows human learners can learn efficiently and effectively by increasing intervals of time between subsequent reviews of previously learned materials (spaced repetition). We investigate the analogy between training neural models and findings in psychology about human memory model and develop an efficient and effective algorithm to train neural models. The core part of our algorithm is a cognitively-motivated scheduler according to which training instances and their “reviews” are spaced over time. Our algorithm uses only 34-50% of data per epoch, is 2.9-4.8 times faster than standard training, and outperforms competing state-of-the-art baselines. Our code is available at scholar.harvard.edu/hadi/RbF/.


Introduction
Deep neural models are known to be computationally expensive to train even with fast hardware (Sutskever et al., 2014;Wu et al., 2016). For example, it takes three weeks to train a deep neural machine translation system on 100 Graphics Processing Units (GPUs) (Wu et al., 2016). Furthermore, a large amount of data is usually required to train effective neural models (Goodfellow et al., 2016;Hirschberg and Manning, 2015). Bengio et al. (2009) and Kumar et al. (2010) developed training paradigms which are inspired by the learning principle that humans can learn more effectively when training starts with easier concepts and gradually proceeds with more difficult concepts. Since these approaches are motivated by a "starting small" strategy they are called curriculum or self-paced learning.
In this paper, we present a novel training paradigm which is inspired by the broad evidence in psychology that shows human ability to retain information improves with repeated exposure and exponentially decays with delay since last exposure (Cepeda et al., 2006;Averell and Heathcote, 2011). Spaced repetition was presented in psychology (Dempster, 1989) and forms the building block of many educational devices, including flashcards, in which small pieces of information are repeatedly presented to a learner on a schedule determined by a spaced repetition algorithm. Such algorithms show that human learners can learn efficiently and effectively by increasing intervals of time between subsequent reviews of previously learned materials (Dempster, 1989;Novikoff et al., 2012).
We investigate the analogy between training neural models and findings in psychology about human memory model and develop a spaced repetition algorithm (named Repeat before Forgetting, RbF) to efficiently and effectively train neural models. The core part of our algorithm is a scheduler that ensures a given neural network spends more time working on difficult training instances and less time on easier ones. Our scheduler is inspired by factors that affect human memory retention, namely, difficulty of learning materials, delay since their last review, and strength of memory. The scheduler uses these factors to lengthen or shorten review intervals with respect to individual learners and training instances. We evaluate schedulers based on their scheduling accuracy, i.e., accuracy in estimating network memory retention with respect to previously-seen instances, as well as their effect on the efficiency and effectiveness of downstream neural networks. 2 The contributions of this paper are: (1) we show that memory retention in neural networks is affected by the same (known) factors that affect memory retention in humans, (2) we present a novel training paradigm for neural networks based on spaced repetition, and (3) our approach can be applied without modification to any neural network.
Our best RbF algorithm uses 34-50% of training data per epoch while producing similar results to state-of-the-art systems on three tasks, namely sentiment classification, image categorization, and arithmetic addition. 3 It also runs 2.9-4.8 times faster than standard training, and outperforms competing state-of-the-art baselines.

Neural and Brain Memory Models
Research in psychology describes the following memory model for human learning: the probability that a human recalls a previously-seen item (e.g., the Korean translation of a given English word) depends on the difficulty of the item, delay since last review of the item, and the strength of the human memory. The relation between these indicators and memory retention has the following functional form (Reddy et al., 2016;Ebbinghaus, 1913): An accurate memory model enables estimating the time by which an item might be forgotten by a learner so that a review can be scheduled for the learner before that time.
We investigate the analogy between the above memory model and memory model of artificial neural networks. Our intuition is that if the probability that a network recalls an item (e.g., correctly predicts its category) depends on the same factors (difficulty of the item, delay since last review of the item, or strength of the network), then we can develop spaced repetition algorithms to efficiently and effectively train neural networks.

Recall Indicators
We design a set of preliminarily experiments to directly evaluate the effect of the aforementioned factors (recall indicators) on memory retention in neural networks. For this purpose, we use a set of training instances that are partially made available to the network during training. This scheme 3 We obtained similar results on QA tasks (Weston et al., 2016) but they are excluded due to space limit.  Figure 1: Effect of recall indicators on network retention. Training data is uniformly at random divided into three disjoint sets A, B, and C that respectively contain 80%, 10%, and 10% of the data. Network retention is computed against set B instances at recall point.
will allow us to intrinsically examine the effect of recall indicators on memory retention in isolation from external effects such as size of training data, number of training epochs, etc.
We first define the following concepts to ease understanding the experiments (see Figure 1): • First and Last review points (fRev and lRev) of a training instance are the first and last epochs in which the instance is used to train the network respectively, • Recall point (Rec) is the epoch in which network retention is computed against some training instances; network retention is the probability that a neural network recalls (i.e. correctly classifies) a previously-seen training instance, and • Delay since last review of a training instance is the difference between the recall point and the last review point of the training instance.
Given training data and a neural network, we uniformly at random divide the data into three disjoint sets: a base set A, a review set B, and a replacement set C that respectively contain 80%, 10%, and 10% of the data. As depicted in Figure 1, instances of A are used for training at every epoch, while those in B and C are partially used for training. The network initially starts to train with {A ∪ C} instances. Then, starting from the first review point, we inject the review set B and remove C, training with {A ∪ B} instances at every epoch until the last review point. The network will then continue training with {A ∪ C} instances until the recall point. At this point, network retention is computed against set B instances, with delay defined as the number of epochs since last review point. The intuition behind using review and replacement sets, B and C respectively, is to avoid external effects (e.g. size of data or network generalization and learning capability) for our intrinsic evaluation purpose.
To conduct these experiments, we identify different neural models designed for different tasks. 4 For each network, we fix the recall point to either the epoch in which the network is fully trained (i.e., obtains its best performance based on standard or "rote" training in which all instances are used for training at every iteration), or partially trained (i.e., obtains half of its best performance based on rote training). We report average results across these networks for each experiment.

Delay since Last Review
As aforementioned, delay since last review of a training instance is the difference between the recall point (Rec) and the last review point (lRev) of the training instance. We evaluate the effect of delay on network retention (against set B instances) by keeping the recall point fixed while moving the sliding window in Figure 1. Figures 2(a) and 2(b) show average network retention across networks for the fully and partially trained recall points respectively. The results show an inverse relationship between network retention and delay since last review in neural networks.

Item Difficulty
We define difficulty of training instances by the loss values generated by a network for the instances. Figure 2(c) shows the difficulty of set B instances at the last review point against average network retention on these instances at recall point. We normalize loss values to unit vectors (to make them com-parable across networks) and then average them across networks for both fully and partially trained recall points. As the results show, network retention decreases as item difficulty increases.

Network Strength
We define strength of a network by its performance on validation data. To understand the effect of network strength on its retention, we use the same experimental setup as before except that we keep the delay (difference between recall point and last review point) fixed while gradually increasing the recall point; this will make the networks stronger by training them for more epochs. Then, at every recall point, we record network retention on set B instances and network accuracy on validation data. Average results across networks for two sets of 10 consecutive recall points (before fully and partially trained recall points) are shown in Figure 2(d). As the results show, network retention increases as memory strength increases.
The above experiments show that memory retention in neural networks is affected by the same factors that affect memory retention in humans: (a) neural networks forget training examples after a certain period of intervening training data (b): the period of recall is shorter for more difficult examples, and (c): recall improves as networks achieve better overall performance. We conclude that delay since last review, item difficulty (loss values of training instances), and memory strength (network performance on validation data) are key indicators that affect network retention and propose to design spaced repetition algorithms that take such indicators into account in training neural networks. Algorithm 1. Leitner System Input: H : training data, V : validation data, k : number of iterations, n : number of queues Output: trained model For i = 0 to n − 1: 5 If epoch%2 i == 0: 6 current batch = current batch + qi 7 End For 8 pmos, dmos, model = train(current batch, V) 9 update queue(Q, pmos, dmos) 10 End for 11 return model

Spaced Repetition
We present two spaced repetition-based algorithms: a modified version of the Leitner system developed in (Reddy et al., 2016) and our Repeat before Forgetting (RbF) model respectively.

Leitner System
Suppose we have n queues {q 0 , q 1 , . . . , q n−1 }. The Leitner system initially places all training instances in the first queue, q 0 . As Algorithm 1 shows, at each training iteration, the Leitner scheduler chooses some queues to train a downstream neural network. Only instances in the selected queues will be used for training the network. During training, if an instance from q i is recalled (e.g. correctly classified) by the network, the instance will be "promoted" to q i+1 , otherwise it will be "demoted" to the first queue, q 0 . 5 The Leitner scheduler reviews instances of q i at every 2 i iterations. Therefore, instance in lower queues (difficult/forgotten instances) are reviewed more frequently than those in higher queues (easy/recalled ones). Figure 3 (bottom) provides examples of queues and their processing epochs. Note that the overhead imposed on training by the Leitner system is O(|current batch|) at every epoch for moving instances between queues.

RbF Memory Models
The challenge in developing memory models is to estimate the time by which a training instance should be reviewed before it is forgotten by the network. Accurate estimation of the review time leads to efficient and effective training. However, a heuristic scheduler such as Leitner system is suboptimal as its hard review schedules (i.e. only 2 iiteration delays) may lead to early or late reviews.
We develop flexible schedulers that take recall indicators into account in the scheduling process. Our schedulers lengthen or shorten inter-repetition intervals with respect to individual training instances. In particular, we propose using density kernel functions to estimate the latest epoch in which a given training instance can be recalled. We aim to investigate how much improvement (in terms of efficiency and effectiveness) can be achieved using more flexible schedulers that utilize the recall indicators.
We propose considering density kernels as schedulers that favor (i.e., more confidently delay) less difficult training instances in stronger networks. As a kernel we can use any non-increasing function of the following quantity: where d i indicates the loss of network for a training instance h i ∈ H, t i indicates the number of epochs to next review of h i , and s e indicates the performance of network-on validation data-at epoch e. We investigate the Gaussian, Laplace, Linear, Cosine, Quadratic, and Secant kernels as described below respectively: where τ is a learning parameter. Figure 4 depicts these kernels with τ = 1. As we will discuss in the next section, we use these kernels to optimize delay with respect to item difficulty and network strength for each training instance.

RbF Algorithm
Our Repeat before Forgetting (RbF) model is a spaced repetition algorithm that takes into account the previously validated recall indicators to train neural networks, see Algorithm 2. RbF divides training instances into current and delayed batches based on their delay values at each iteration. Instances in the current batch are those that RbF is less confident about their recall and therefore are reviewed (used to re-train the network) at current iteration. On the other hand, instances in the delayed batch are those that are likely to be recalled by the network in the future and therefore are not reviewed at current epoch. At each iteration, the RbF scheduler estimates the optimum delay (number of epochs to next review) for each training instance in the current batch. RbF makes such item-specific estimations as follows: Given the difficulty of a training instance d i , the memory strength of the neural network at epoch e, s e , and an RbF memory model f (see section 3.2.1), RbF scheduler estimates the maximum delayt i for the instance such that it can be recalled with a confidence greater than the given threshold η ∈ (0, 1) at time e +t i . As described before, d i and s e can be represented by the current loss of the network for the instance and the current performance of the network on validation data respectively. Therefore, the maximum delay between the current (epoch e) and next reviews of the instance can be estimated as follows:t ti = ti − 1 ∀hi ∈ delayed bach 8 End for 9 return model whereτ is the optimum value for the learning parameter obtained from validation data, see Equation (10). In principle, reviewing instances could be delayed for any number of epochs; in practice however, delay is bounded both below and above (e.g., by queues in the Leitner system). Thus, we assume that, at each epoch e, instances could be delayed for at least one iteration and at most k − e iterations where k is the total number of training epochs. We also note that t i is a lower bound of the maximum delay as s e is expected to increase and d i is expected to decrease as the network trains in next iterations.
Algorithm 2 shows the outline of the proposed RbF model. We estimate the optimum value of τ (line 5 of Algorithm 2) for RbF memory models using validation data. In particular, RbF uses the loss values of validation instances and strength of the network obtained at the previous epoch to estimate network retention for validation instances at the current epoch (therefore t i = 1 for every validation instance). The parameter τ for each memory model is computed as follows: where a j ∈ (0, 1) is the current accuracy of the model for the validation instance h j . RbF then predicts the delay for current batch instances and reduces the delay for those in the delayed batch by one epoch. The overhead of RbF is O(|H|) to compute delays and O(|V|) to computeτ . Note that (9) and (10) Table 1 describes the tasks, datasets, and models that we consider in our experiments. It also reports the training epochs for which the models produce their best performance on validation data (based on rote training). We note that the Addition dataset is randomly generated and contains numbers with at most 4 digits. 6 We consider three schedulers as baselines: a slightly modified version of the Leitner scheduler (Lit) developed in Reddy et al. (2016) for human learners (see Footnote 5), curriculum learning (CL) in which training instances are scheduled with respect to their easiness (Jiang et al., 2015), and the uniform scheduler of rote training (Rote) in which all instances are used for training at every epoch. For Lit, we experimented with different queue lengths, n = {3, 5, 7}, and set n = 5 in the experiments as this value led to the best performance of this scheduler across all datasets.

Experiments
Curriculum learning starts training with easy instances and gradually introduces more complex instances for training. Since easiness information is not readily available in most datasets, previous approaches have used heuristic techniques (Spitkovsky et al., 2010;Basu and Christensen, 2013) or optimization algorithms (Jiang et al., 2015(Jiang et al., , 2014 to quantify easiness of training instances. These approaches consider an instance as easy if its loss is smaller than a threshold (λ). We adopt this technique as follows: at each iteration e, we divide the entire training data into easy and hard sets using iteration-specific λ e and the loss values of instances, obtained from the current partially-trained network. All easy instances in conjunction with α e ∈ [0, 1] fraction of easiest hard instances (those with smallest loss values greater than λ e ) are used for training at iteration e. We set  Figure 6: Accuracy of schedulers in predicting network retention. For these experiments recall confidence is set to its default value, η = 0.5.
each λ e to the average loss of training instances that are correctly classified by the current partiallytrained network. Furthermore, at each iteration e, we set α e = e/k to gradually introduce complex instances at every new iteration. 7 Note that we treat all instances as easy at e = 0. Performance values reported in experiments are averaged over 10 runs of systems and the confidence parameter η is always set to 0.5 unless otherwise stated.

Evaluation of Memory Models
In these experiments, we evaluate memory schedulers with respect to their accuracy in predicting network retention for delayed instances. Since curriculum learning does not estimate delay for training instances, we only consider Leitner and RbF schedulers in these experiments.
For this evaluation, if a scheduler predicts a delay t for a training instance h at epoch e, we evaluate network retention with respect to h at epoch e + t. If the network recalls (correctly classifies) the instance at epoch e + t, the scheduler has correctly predicted network retention for h, and otherwise, it has made a wrong prediction. We use this binary outcome to evaluate the accuracy of each scheduler. Note that the performance of schedulers on instances that have not been delayed is not a major concern. Although failing to delay an item inversely affects efficiency, it makes the network stronger by providing more instances to train from. Therefore, we consider a good scheduler as the one that accurately delays more items. Figure 6 depicts the average accuracy of schedulers in predicting networks' retention versus the average fraction of training instances that they delayed per epoch. As the results show, all schedulers 7 k is the total number of iterations. delay substantial amount of instances per epoch. In particular, Cos and Qua outperform Lit in both predicting network retention and delaying items, delaying around 50% of training instances per epoch. This is while Gau and Sec show comparable accuracy to Lit but delay more instances. On the other hand, Lap, which has been found effective in Psychology, and Lin are less accurate in predicting network retention. This is because of the tradeoff between delaying more instances and creating stronger networks. Since these schedulers are more flexible in delaying greater amount of instances, they might not provide networks with enough data to fully train. Figure 7 shows the performance of RbF schedulers with respect to the recall confidence parameter η, see Equation (9). As the results show, schedulers have poor performance with smaller values of η. This is because smaller values of η make schedulers very flexible in delaying instances. However, the performance of schedulers are not dramatically low even with very small ηs. Our further analyses on the delay patterns show that although a smaller η leads to more delayed instances, the delays are significantly shorter. Therefore, most delayed instances will be "reviewed" shortly in next epochs. These bulk reviews make the network stronger and help it to recall most delayed instance in future iterations.
On the other hand, greater ηs lead to more accurate schedulers at the cost of using more training data. In fact, we found that larger ηs do not delay most training instances in the first few iterations. However, once the network obtains a reasonably high performance, schedulers start delaying instances for longer durations. We will further study this effect in the next section.

Efficiency and Effectiveness
We compare RbF against Leitner and curriculum learning in terms of efficiency of training and effectiveness of trained models. We define effectiveness as the accuracy of a trained network on balanced test data, and efficiency as (a): fraction of instances used for training per epoch, and (b): required time for training the networks. For RbF schedulers, we set η to 0.5 and consider the best performing kernel Cosine with η = 0.9 based on results in Figure 7.
The results in Table 2 show that all training paradigms have comparable effectiveness (Accuracy) to that of rote training (Rote). Our RbF schedulers use less data per epoch (34-50% of data) and run considerably faster than Rote (2.90-4.78 times faster for η = 0.5). The results also show that Lit is slightly less accurate but runs 2.87 time faster than Rote; note that, as a scheduler, Lit is less accurate than RbF models, see Figures 6 and 7.
In addition, CL leads to comparable performance to RbF but is considerably slower than other schedulers. This is because this scheduler has to identify easier instances and sort the harder ones to sample training data at each iteration. Overall, the performance of Lit, CL, Cos η = .5 and Cos η = .9 are only 2.76, 1.90, 1.88, and 0.67 absolute values lower than that of Rote respectively. Considering the achieved efficiency, these differences are negligible (see the overall gain in Table 2). Figure 8 reports detailed efficiency and effectiveness results across datasets and networks. For clear illustration, we report accuracy at iterations 2 i ∀i in which Lit is trained on the entire data, and consider Cos η = .5 as RbF scheduler. In terms of efficiency (first row of Figure 8), CL starts with (small set of) easier instances and gradually increases the amount of training data by adding slightly harder instances into its training set. On the other hand, Lit and RbF start big and gradually delay reviewing (easy) instances that the networks have learned. The difference between these two training paradigms is apparent in Figures 8(a)-8(c).
The results also show that the efficiency of a training paradigm depends on the initial effectiveness of the downstream neural network. For CL to be efficient, the neural network need to initially have low performance (accuracy) so that the scheduler works on smaller set of easy instances. For example, in case of Addition, Figures 8(b) and 8(e), the initial network accuracy is only 35%, therefore most instances are expected to be initially treated as hard instances and don't be used for training. On the other hand, CL shows a considerably lower efficiency for networks with slightly high initial accuracy, e.g. in case of IMDb or CIFAR10 where the initial network accuracy is above 56%, see In contrast to CL, Lit and RbF are more efficient when the network has a relatively higher initial performance. A higher initial performance helps the schedulers to more confidently delay "reviewing" most instances and therefore train with a much smaller set of instances. For example, since the initial network accuracy in IMDb or CIFAR10 is above 56%, Lit and RbF are considerably more efficient from the beginning of the training process. However, in case of low initial performance, Lit and RbF tend to avoid delaying instances at lower iterations which leads to poor efficiency at the beginning. This is the case for the Addition dataset in which instances are gradually delayed by these two schedulers even at epoch 8 when the performance of the network reaches above 65%, see Figures 8(e) and 8(b). However, Lit gains its true efficiency after iteration 12, see Figure 8(b), while RbF still gradually improves the efficiency. This might be because of the lower bound delays that RbF estimates, see Equation (9). Furthermore, the effectiveness results in Figure 8 (bottom) show that all schedulers produce comparable accuracy to the Rote scheduler throughout the training process, not just at specific iterations. This indicates that these training paradigms can much faster achieve the same generalizability as standard training, see Figures 8(b) and 8(e).

Robustness against Overtraining
We investigate the effect of spaced repetition on overtraining. The optimal number of training epochs required to train fastText on the IMDb dataset is 8 epochs (see Table 1). In this experiment, we run fastText on IMDb for greater number of iterations to investigate the robustness of different schedulers against overtraining. The results in Figure 9 show that Lit and RbF (Cos η = 0.5) are more robust against overtraining. In fact, the performance of Lit and RbF further improve at epoch 16 while CL and Rote overfit at epoch 16 (note that CL and Rote also require considerably more amount of time to reach to higher iterations). We attribute the robustness of Lit and RbF to the scheduling mechanism which helps the networks to avoid retraining with easy instances. On the other hand, overtraining affects Lit and RbF at higher training iterations, compare performance of each scheduler at epochs 8 and 32. This might be because these training paradigms overfit the network by paying too much training attention to very hard instances which might introduce noise to the model. Ebbinghaus (1913Ebbinghaus ( , 2013, and recently Murre and Dros (2015), studied the hypothesis of the exponential nature of forgetting, i.e. how information is lost over time when there is no attempt to retain it. Previous research identified three critical indicators that affect the probability of recall: repeated exposure to learning materials, elapsed time since their last review (Ebbinghaus, 1913;Wixted, 1990;Dempster, 1989), and more recently item difficulty (Reddy et al., 2016). We based our investigation on these findings and validated that these indicators indeed affect memory retention in neural networks. We then developed training paradigms that utilize the above indicators to train networks. Bengio et al. (2009) andKumar et al. (2010) also developed cognitively-motivated training paradigms which are inspired by the principle that learning can be more effective when training starts with easier concepts and gradually proceeds with more difficult ones. Our idea is motivated by the spaced repetition principle which indicates learning improves with repeated exposure and decays with delay since last exposure (Ebbinghaus, 1913;Dempster, 1989). Based on this principle, we developed schedulers that space the reviews of training instances over time for efficient and effective training of neural networks.

Conclusion and Future Work
We developed a cognitively-motivated training paradigm (scheduler) that space instances over time for efficient and effective training of neural networks. Our scheduler only uses a small fraction of training data per epoch but still effectively train neural networks. It achieves this by estimating the time (number of epochs) by which training could be delayed for each instance. Our work was inspired by three recall indicators that affect memory retention in humans, namely difficulty of learning materials, delay since their last review, and memory strength of the learner, which we validated in the context of neural networks.
There are several avenues for future work including the extent to which our RbF model and its kernels could be combined with curriculum learning or Leitner system to either predict easiness of novel training instances to inform curriculum learning or incorporate Leitner's queueing mechanism to the RbF model. Other directions include extending RbF to dynamically learn the recall confidence parameter with respect to network behavior, or developing more flexible delay functions with theoretical analysis on their lower and upper bounds.