Continual Lifelong Learning in Natural Language Processing: A Survey

Continual learning (CL) aims to enable information systems to learn from a continuous data stream across time. However, it is difficult for existing deep learning architectures to learn a new task without largely forgetting previously acquired knowledge. Furthermore, CL is particularly challenging for language learning, as natural language is ambiguous: it is discrete, compositional, and its meaning is context-dependent. In this work, we look at the problem of CL through the lens of various NLP tasks. Our survey discusses major challenges in CL and current methods applied in neural network models. We also provide a critical review of the existing CL evaluation methods and datasets in NLP. Finally, we present our outlook on future research directions.


Introduction
Human beings learn by building on their memories and applying past knowledge to understand new concepts. Unlike humans, existing neural networks (NNs) mostly learn in isolation and can be used effectively only for a limited time. Models become less accurate over time, for instance, due to the changing distribution of data -the phenomenon known as concept drift (Schlimmer and Granger, 1986;Widmer and Kubat, 1993). With the advent of deep learning, the problem of continual learning (CL) in Natural Language Processing (NLP) is becoming even more pressing, as current approaches are not able to effectively retain previously learned knowledge and adapt to new information at the same time.
Throughout the years, numerous methods have been proposed to address the challenge known as catastrophic forgetting (CF) or catastrophic interference (McCloskey and Cohen, 1989). Naïve approaches to mitigate the problem, such as retraining the model from scratch to adapt to a new task (or a new data distribution), are costly and time-consuming. This is reinforced by the problems of capacity saturation and model expansion. Concretely, a parametric model, while learning data samples with different distributions or progressing through a sequence of tasks, eventually reaches a point at which no more knowledge can be stored -i.e. its representational capacity approaches the limit (Sodhani et al., 2020;Aljundi et al., 2019). At this point, either model's capacity is expanded, or a selective forgetting -which likely incurs performance degradation -is applied. The latter choice may result either in a deterioration of prediction accuracy on new tasks (or data distributions) or forgetting the knowledge acquired before. This constraint is underpinned by a defining characteristic of CL, known as the stability-plasticity dilemma. Specifically, the phenomenon considers the model's attempt to strike a balance between its stability (the ability to retain prior knowledge) and its plasticity (the ability to adapt to new knowledge).
CL in the NLP domain, as opposed to computer vision or robotics, is still nascent (Greco et al., 2019;Sun et al., 2020). The differences are reflected in the small number of proposed methods aiming to alleviate the aforementioned issues and the evaluation benchmarks. To the best of our knowledge, apart from the work of Chen and Liu (2018), our paper is the only study summarizing the research progress related to continual, lifelong learning in NLP. This work is licensed under a Creative Commons Attribution 4.0 International License. License details: http:// creativecommons.org/licenses/by/4.0/.

Learning Paradigms
In this section, we discuss principles of CL and related machine learning (ML) paradigms, as well as contemporary approaches to mitigate CF.

Continual Learning
Continual learning 1 (Ring, 1994) is a machine learning paradigm, whose objective is to adaptively learn across time by leveraging previously learned tasks to improve generalization for future tasks. Hence, CL studies the problem of sequential learning from a continuous stream of data, drawn from a potentially non-stationary distribution, and reusing gained knowledge throughout the lifetime while avoiding CF.
More formally: the goal is to sequentially learn a model f : X × T → Y from a large number of tasks T . The model is trained on examples (x i , y i ), such that: x i ∈ X t i is an input feature vector, y i ∈ Y t i is a target vector (e.g. a class label), and t i ∈ T denotes a task descriptor (in the simplest case t i = i) where i ∈ Z. The objective is to maximize the function f (parameterized by θ ∈ R) at the task T i , while minimizing CF for tasks T 1 . . . T i−1 .
Although the above-mentioned definitions of CL may seem fairly general, there are certain desired properties, which are summarized in Table 1.

Knowledge retention
The model is not prone to catastrophic forgetting.

Forward transfer
The model learns a new task while reusing knowledge acquired from previous tasks.

Backward transfer
The model achieves improved performance on previous tasks after learning a new task.

On-line learning
The model learns from a continuous data stream.

No task boundaries
The model learns without requiring neither clear task nor data boundaries. Fixed model capacity Memory size is constant regardless of the number of tasks and the length of a data stream. In practice, current CL systems often relax at least one of the requirements listed in Table 1. Most methods still follow the off-line learning paradigm -models are trained using batches of data shuffled in such a way as to satisfy the independent and identically distributed (i.i.d.) assumption. Consequently, many models are trained solely in a supervised fashion with large labeled datasets, and thus they are not exposed to more challenging situations involving few-shot, unsupervised, or self-supervised learning. Additionally, existing approaches often fail to restrict themselves to make a single pass over the data, and this entails longer learning times. Moreover, the number of tasks as well as their identity are frequently known to the system from the outset.

Related Machine Learning Paradigms
Traditionally, many ML models are designed to be trained for merely a single task. However, it has been proven that transferring knowledge learned from one task and applying it to another task is a powerful mechanism for NNs. In many respects, CL bears some resemblance to other dominant learning approaches. Therefore, in this section, we draw connections between various ML paradigms. We provide an overview of the approaches, and in particular, we shed light on the shared principles as well as on the aspects that make CL different from other ML paradigms (see Table 2).
In principle, we assume that the ability of a model to generalize can be considered one of its most important characteristics. Importantly, if tasks are related, then knowledge transfer between tasks should lead to a better generalization and faster learning (Lopez-Paz and Ranzato, 2017;Sodhani et al., 2020). Therefore, we compare the paradigms taking into account how well they are able to leverage an inductive bias. Specifically, positive backward transfer improves the performance on old tasks, while negative Paradigm Definition Properties* Related works

Transfer learning
Transferring knowledge from a source task/domain to a target task/domain to improve the performance of the target task.

Multi-task learning
Learning multiple related tasks jointly, using parameter sharing, to improve the generalization of all the tasks. + positive transfer -negative transfer -task boundaries -off-line learning (Caruana, 1997)

Metalearning
Learning to learn. Learning generic knowledge, given a small set of training examples and numerous tasks, and quickly adapting to a new task.
+ forward transfer -no backward transfer -no knowledge retention -off-line learning (Thrun and Pratt, 1998)

Approaches to Continual Learning
The majority of existing CL approaches tend to apply a single model structure to all tasks (Li et al., 2019) and control CF by various scheduling schemes. We distinguish three main families of methods: rehearsal, regularization, and architectural as well as a few hybrid categories. Importantly, the number of models originating purely from the NLP domain is quite limited.
Rehearsal methods rely on retaining some training examples from prior tasks, so that they can later be shown to a task at hand. Rebuffi et al. (2017b) proposed the most well-known method for incremental class learning, i.e. the iCaRL model. Furthermore, as training samples are kept per each task and are periodically replayed while learning the model, the computing and memory requirements of the model increase proportionally to the number of tasks. To reduce storage, it is advised to use either latent replay (Pellegrini et al., 2019) or pseudo-rehearsal (Robins, 1995) methods.
Pseudo-rehearsal methods are a sub-group of rehearsal methods. Instead of using training samples from memory, pseudo-rehearsal models generate examples by knowing the probability distributions of previous task samples. Notable approaches include a generative autoencoder (FearNet, Kemker and Kanan, 2018) and a model based on Generative Adversarial Networks (DGR, Shin et al., 2017).
Regularization methods are single-model approaches that rely on a fixed model capacity with an additional loss term that aids knowledge consolidation while learning subsequent tasks or data distributions. For instance, Elastic Weight Consolidation (EWC, Kirkpatrick et al., 2016) reduces forgetting by regularizing the loss; in other words, it slows down the learning of parameters important for previous tasks.
Memory methods are a special case of regularization methods that can be divided into two groups: synaptic regularization (Zenke et al., 2017;Kirkpatrick et al., 2016;Chaudhry et al., 2018) and episodic memory (Li and Hoiem, 2016;Jung et al., 2016;Lopez-Paz and Ranzato, 2017;Chaudhry et al., 2019b;. The former methods are focused on reducing interference with the consolidated knowledge by adjusting learning rates in a way that prevents changes to previously learned model parameters. While the latter store training samples from previously seen data, which are later rehearsed to allow learning new classes. Importantly, Gradient Episodic Memory (GEM, Lopez-Paz and Ranzato, 2017)  Knowledge distillation methods bear a close resemblance to episodic memory methods, but unlike GEM they keep the predictions at past tasks invariant (Rebuffi et al., 2017b;Lopez-Paz and Ranzato, 2017). In particular, it is a class of methods alleviating CF by relying on knowledge transfer from a large network model (teacher) to a new, smaller network (student) (Hinton et al., 2015). The underlying idea is that the student model learns to generate predictions of the teacher model. As demonstrated in Kim and Rush (2016) and Wei et al. (2019), knowledge distillation approaches can prove especially suitable for neural machine translation models, which are mostly large, and hence reduction in size is beneficial.
Architectural methods prevent forgetting by applying modular changes to the network's architecture and introducing task-specific parameters. Typically, previous task parameters are kept fixed (

Evaluation
Even though CL is now experiencing a surge in the number of proposed new methods, there is no unified approach when it comes to their evaluation using benchmark datasets and metrics (Parisi et al., 2019). And as we will show in this section, this is especially true in the NLP domain. There is a scarcity of datasets and benchmark evaluation schemes available specifically for CL in NLP.

Protocols
Basically, researchers often focus on evaluating the plasticity (generalization) side and the stability (consistency) side of the model. Various protocols and methodologies for CL method evaluation have been devised throughout the years (e.g. Kemker et al., 2017;Serra et al., 2018;Sodhani et al., 2020;Pfülb and Gepperth, 2019;Chaudhry et al., 2019a); however, many of them suffer from deficiencies such as small datasets or a limited number of evaluated methods, to name a few. Furthermore, as observed by Chaudhry et al. (2019a), the prevalent learning protocol followed in many CL research efforts stems from supervised learning, where many passes over the data of each task are performed. The authors claimed that in a CL setting this approach is flawed as with more passes over the data of a given task, the model degrades more because it forgets previously acquired knowledge.
In a similar vein,  contended that NLP models are predominantly evaluated with respect to their performance on a held-out test set, which is measured after the training is done for a given task. Therefore, Chaudhry et al. (2019a) introduced a learning protocol that, according to the authors, is more suitable for CL as it satisfies the constraint of a single pass over the data, which is motivated by the need for a faster learning process. Another recent approach, proposed by d'Autume et al. (2019), relies on a sequentially presented stream of examples derived from various datasets in one pass, without revealing dataset boundary or identity to the model.

Benchmarks and Metrics
For years the NLP domain has lagged behind computer vision and other ML areas (e.g. Kirkpatrick et al., 2016;Zenke et al., 2017;Lomonaco and Maltoni, 2017;Rebuffi et al., 2017b) when it comes to the availability of standard CL-related benchmarks (Greco et al., 2019;Wang et al., 2019b). However, the situation has slightly improved recently with an introduction of a handful of multi-task benchmarks. In particular, GLUE Greco et al., 2019) and SUPERGLUE (Wang et al., 2019a) benchmarks track performance on eleven and ten language understanding tasks respectively, using existing NLP datasets. Along the same line, McCann et al. (2018) presented the Natural Language Decathlon (DECANLP) benchmark for evaluating the performance of models across ten NLP tasks. The decathlon score (decaScore) is an additive combination of various metrics specific for each of the ten selected tasks (i.e. the normalized F1 metric, BLEU and ROUGE scores, among others). Similar to DECANLP, a recently proposed Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark (Hu et al., 2020) also uses a diverse set of NLP tasks and task-specific measures to evaluate the performance of cross-lingual transfer learning. XTREME consists of nine tasks derived from four different categories and uses zero-shot cross-lingual transfer with the English language as the source language for evaluation.
In principle, CL models should not only be evaluated against traditional performance metrics (such as model accuracy); it is also important to measure their ability to reuse prior knowledge. Similarly, evaluating how quickly models learn new tasks is also essential in the CL setting. Although CF is crucial to address in CL systems, there is no consensus on how to measure it (Pfülb and Gepperth, 2019). Arguably, the two most popular and general metrics to address this issue are Average Accuracy and Forgetting Measure (Lopez-Paz and Ranzato, 2017;Chaudhry et al., 2018Chaudhry et al., , 2019b. The former evaluates the average accuracy, while the latter measures forgetting after the model is trained continually on all the given task mini-batches. Concretely, we aim to measure test performance on the dataset D for each of the T tasks, letting a j,i be the performance of the model on the held-out test set of task ti after the model is trained on task tj. Later Chaudhry et al. (2019a) proposed the third metric, Learning Curve Area (LCA), that measures how quickly a model is able to learn. The three metrics are defined as follows: • Average Accuracy: A ∈ [0, 1] (Chaudhry et al., 2018). The average accuracy after incremental training from the first task to T is given as: et al., 2018). The average forgetting measure after incremental training from the first task to T is defined as: where f j i is the forgetting on task ti after the model is trained up to task tj and computed as: • Learning Curve Area: LCA ∈ [0, 1] (Chaudhry et al., 2019a). LCA is the area under the Z b curve, which captures the learner's performance on all T tasks. Z b is the average accuracy after observing the b-th mini-batch and is defined as: where b denotes the mini-batch number.
Similarly, Kemker et al. (2017) proposed three metrics for evaluating CF, i.e. the metrics evaluate the ability of a model to retain previously acquired knowledge and how well it acquires new information. In the NLP domain,  introduced a new metric, based on an online (prequential) encoding (Blier and Ollivier, 2018), which measures the adoption rate of an existing model to a new task. Specifically, the metric called online codelength (D) is defined as follows: where |y| denotes the number of possible labels (classes) in the dataset D, and θ D i stands for the model parameters trained on a particular subset of the dataset. Similar to LCA (Chaudhry et al., 2019a), online codelength is also related to an area under the learning curve.
While most CL methods consider settings without human-in-the-loop, some allow a human domain expert to provide the model with empirical knowledge about the task at hand. For instance, Prokopalo et al. (2020) introduced the evaluation of human assisted learning across time by leveraging user-defined model adaptation policies for NLP and speech tasks, such as machine translation and speaker diarization.

Evaluation Datasets
Most widely adopted CL benchmark datasets are image corpora such as PERMUTED MNIST (Kirkpatrick et al., 2016), CUB-200 (Welinder et al., 2010;Wah et al., 2011), or split CIFAR-10/100 (Lopez-Paz and Ranzato, 2017). Benchmark corpora have also been proposed for objects -CORE50 (Lomonaco and Maltoni, 2017) and sound -AUDIOSET (Gemmeke et al., 2017). However, none of the well-established standard datasets used in the CL field is related to NLP. Therefore, due to the scarcity of NLP-curated datasets, some of the above-mentioned datasets have also been utilized for NLP scenarios.

Name
Details Related works

XCOPA -Cross-lingual Choice of Plausible Alternatives
• a typologically diverse multilingual dataset for causal commonsense reasoning, which is the translation and reannotation • covers 11 languages from distinct families (Edoardo M. Ponti and Korhonen, 2020) WEBTEXT • a dataset of millions of webpages suitable for learning language models without supervision • 45 million links scraped from Reddit, 40 GB dataset (Radford et al., 2019)

C4 -Colossal Clean Crawled Corpus
• a dataset constructed from Common Crawl's web crawl corpus and serves as a source of unlabeled text data • 17 GB dataset (Raffel et al., 2020)

LIFELONG FEWREL -Lifelong Few-Shot Relation Classification Dataset
• sentence-relation pairs derived from Wikipedia distributed over 10 disjoint clusters (representing different tasks) (Wang et al., 2019b) (Obamuyide and Vlachos, 2019) LIFELONG SIMPLE QUESTIONS • single-relation questions divided into 20 disjoint clusters (i.e. resulting in 20 tasks) (Wang et al., 2019b)  Similarly, in the absence of NLP benchmark corpora, the majority of papers use adopted versions of popular NLP datasets. One such example is domain adaptation, where researchers frequently use different, standard NLP corpora for in-domain and out-of-domain datasets. Farquhar and Gal (2018) stressed that prior research often presented incomplete evaluations, and utilized dedicated CL datasets or environments that cannot be considered general, one-size-fits-all benchmarks. As the scholars argued, such benchmarks are useful in narrow cases, limited to their respective subdomains. The number of NLP-specific CL datasets is still very limited, even though there have been lately a few notable attempts to create such corpora (summarized in Table 3).
Importantly, as Parisi et al. (2019) contended, with the increasing complexity of the evaluation dataset at hand, the overall performance of the model often decreases. The scholars attributed this to the fact that the majority of methods are tailored to work only for less complex scenarios, as they are not robust and flexible enough to alleviate CF in less controlled experimental conditions. In a similar vein,  stressed that the recent tendency to construct datasets that are easy to solve without requiring generalization or abstraction is an impediment toward general linguistic intelligence. Hence, we advocate further research on establishing challenging evaluation datasets and evaluation metrics for CL in NLP that will allow to capture how well models generalize to new, unseen tasks.

Continual Learning in NLP Tasks
Natural language processing covers a diverse assortment of tasks. Despite the variety of NLP tasks and methods, there are some common themes. On a syntax level, sentences in any domain or task follow the same syntax rules. Furthermore, regardless of task or domain, there are words and phrases that have almost the same meaning. Therefore, sharing of syntax and semantic knowledge should be feasible across NLP tasks. In this section, we explore how CL methods are used in most popular NLP tasks.

Word and Sentence Representations
Distributed word vector representations underlie many NLP applications. Although high-quality word embeddings can considerably boost performance in downstream tasks, they cannot be considered a silver bullet as they suffer from inherent limitations. Typically, word embeddings are trained on large-size general corpora, as the size of in-domain corpora is in most cases not sufficient. This comes at a cost, since embeddings trained on general-purpose corpora are often not suitable for domain-specific downstream tasks, and in result, the overall performance suffers. In a CL setting, this also implies that vocabulary may change with respect to two dimensions: time and domain. There is an established consensus that the meaning of words changes over time due to complicated linguistic and social processes (e.g. Kutuzov et al., 2018;Shoemark et al., 2019). Hence, it is important to detect and accommodate shifts in meaning and data distribution, while preventing previously learned representations from CF.
In general, a CL scenario for word and sentence embeddings has not received much attention so far, except for a handful of works. To tackle this problem, for example  proposed a metalearning method, which leverages knowledge from past multi-domain corpora to generate improved new domain embeddings.  introduced a sentence encoder updated over time using matrix conceptors to continually learn corpus-dependent features. Importantly, Wang et al. (2019b) argued that when a NN model is trained on a new task, the embedding vector space undergoes undesired changes, and in result the embeddings are infeasible for previous tasks. To mitigate the problem of embedding space distortion, they proposed to align sentence embeddings using anchoring. Recently a research line at the intersection of word embeddings and language modeling, termed contextual embeddings, has emerged and demonstrates state-of-the-art results across numerous NLP tasks. In the next section, we will look closely at how this approach to learning embeddings is geared towards CL.

Language Modeling
Contextual representations learned via unsupervised pre-trained language models (LMs), such as: ULM-FIT (Howard and Ruder, 2018), ELMO (Peters et al., 2018) or BERT (Devlin et al., 2019), allow to attain strong performance on a wide range of supervised NLP tasks. Precisely, thanks to inductive transfer, complex task-specific architectures have become less needed. In consequence, the process of training many neural-based NLP systems boils down to two steps: (1) an NN-based language model is trained on a large unlabeled text data; (2) this pre-trained language representation model is then reused in supervised downstream tasks. In principle, a large LM trained on a sufficiently large and diverse corpus is able to perform well across many datasets and domains (Radford et al., 2019). Furthermore, Gururangan et al. (2020) studied the effects of task-adaptation as well as domain-adaptation on the transferability of adapted pre-trained LMs across domains and tasks. The authors concluded that continuous domain-and task-adaptive pre-training of LMs leads to performance gains in downstream NLP tasks.
Research interest in LM-based methods for CL in NLP has recently spiked. d'Autume et al. (2019) proposed an episodic memory-based model, MBPA++, that augments the encoder-decoder architecture. In order to continually learn, MBPA++ also performs sparse experience replay and local adaptation. The scholars claimed that MBPA++ trains faster than A-GEM, and it does not take longer to train it than an encoder-decoder model. While this is possible due to sparse experience replay, yet MBPA++ requires extra memory. In a similar vein, LAMOL (Sun et al., 2020) is based on language modeling. Unlike MBPA++, this method does not use any extra memory. LAMOL mitigates CF by means of pseudo-sample generation, as the model is trained on the mix of new task data and pseudo old samples.

Question Answering
Question answering (QA) is considered a traditional NLP task, encompassing reading comprehension as well as information and relation extraction among others. Conceptually, it is also very much related to conversational agents, such as chatbots and dialogue agents. Hence, not only in research settings but even more so in real-life scenarios (e.g. in the conversation), it is immensely important for such systems to continuously extract and accumulate new knowledge (Chen and Liu, 2018). It is believed that a good dialogue agent should be able to not only interact with users by responding and asking questions, but also to learn from both kinds of interaction (Li et al., 2017b).
Although question answering is a stand-alone NLP task, some researchers (e.g. Kumar et al., 2016;McCann et al., 2018) proposed to view NLP tasks through the lens of QA. In the context of CL, both d'Autume et al. (2019) and Sun et al. (2020) reported experimental results on a QA task. Research in dialogue agents, which are able to continually learn, is a very active area (e.g. Gasic et al., 2014;Su et al., 2016). Findings of Li et al. (2017a) indicate that a conversational model initially trained with fixed data can improve itself, when it learns from interactions with humans in an on-line fashion. Interestingly, information and relation extraction were an early subject of research interest in CL. Information extraction is considered one of the first research areas, which embraced the goal of never-ending learning. A semi-supervised NELL (Carlson et al., 2010) and an unsupervised ALICE (Banko and Etzioni, 2007) systems, which iteratively extract information and build general domain knowledge, were at the forefront of CL in NLP. In the case of relation extraction, Wang et al. (2019b) introduced an embedding alignment method to enable CL for relation extraction models. Also, Obamuyide and Vlachos (2019) proposed to extend the work of Wang et al. (2019b) by framing the lifelong relation extraction as a meta-learning problem; however, without the costly need for learning additional parameters.

Sentiment Analysis and Text Classification
Sentiment analysis (SA) is a popular choice for evaluating models on text classification. Arguably the most pressing problem of current approaches to SA is their poor performance on new domains. Therefore, various domain adaptation methods have been proposed to improve the performance of SA models in the multi-domain scenario (consult Barnes et al., 2018). This issue is of utmost importance if one thinks about CL in sentiment classification. One of the earliest approaches to CL for SA was proposed in Chen et al. (2015). According to Chen and Liu (2018), CL can enable SA models to adapt to a large number of domains, since many new domains may already be covered by other past domains. Additionally, SA systems should become more accurate not only in classification but also in the discovery of word polarities in specific domains. Research in opinion about aspects has been conducted as well. Shu et al. (2016) presented an unsupervised CL approach to classify opinion targets into entities and aspects. Furthermore, Shu et al. (2017) proposed a method based on conditional random fields to improve supervised aspect extraction across time. Experiments on text classification in the CL setting were performed in d' Autume et al. (2019) and Sun et al. (2020).

Machine Translation
The approach introduced by Luong and Manning (2015) laid the groundwork for subsequent studies in adapting neural machine translation (NMT). More specifically, the authors explored the adaptation through continued training, where an NMT model trained using large corpora in one domain can later initialize a new NMT model for another domain. Their findings suggested that fine-tuning of the NMT model trained on out-of-domain data using a small in-domain parallel corpus boosts performance. Likewise, other works (e.g. Freitag and Al-Onaizan, 2016;Chu et al., 2017) supported this claim.  pointed out that, due to over-fitting, some amount of knowledge learned from the outof-domain corpus is being forgotten during fine-tuning. Hence, such domain adaptation techniques are prone to CF. NMT models experience difficulties when dealing with data from diverse domains, hence we argue this is not a sufficient solution. As dominant fine-tuning approaches require training and maintaining a separate model for each language or domain,  proposed to add light-weight task-specific adapter modules to support parameter-efficient adaptation. We further argue that NMTas opposed to phrase-based MT -rarely incorporates translation memory, and so it is inherently harder for NMT models to adapt using active or interactive learning. However, some attempts have been made (e.g. Peris and Casacuberta, 2018;Kaiser et al., 2017;. In a similar vein, there were approaches (Sokolov et al., 2017) to incorporate bandit learners, which implicitly involve domain adaptation and on-line learning, for MT systems. We share the viewpoint of Farajian et al. (2017), that NMT models ultimately should be able to adapt on-the-fly to on-line streams of diverse data (i.e. language, domain), and thus CL for NMT is essential.
While domain adaptation methods are widely used in the context of adapting NMT models, there have also been other attempts. Multilingual NMT (Dong et al., 2015;Firat et al., 2016;Ha et al., 2016;Johnson et al., 2017;Tan et al., 2019) can be framed as a multi-task learning problem. Multilingual NMT aims to use a single model to translate between multiple languages. Such systems are beneficial not only because they can handle multiple translation directions using a single model, and thus reduce training and maintenance costs, but also due to joint training with high-resource languages they can improve performance on low-and zero-resource languages (Arivazhagan et al., 2019). To eliminate the need for retraining the entire NMT system, Escolano et al. (2020) proposed a language-specific encoder-decoder architecture, where languages are mapped into a shared space, and either encoder or decoder is frozen when training on a new language.
Another related research line is curriculum learning. Most approaches concentrate on the selection of training samples according to their relevance to the translation task at hand. Different methods have been applied, for example, van der Wees et al. (2017) 2019) proposed an on-line knowledge distillation approach, in which the best checkpoints are utilized as the teacher model. Lately,  demonstrated that label prediction continual learning leveraging compositionality brings improvements in NMT.

Research Gaps and Future Directions
Although there is a growing number of task-specific approaches to CL in NLP, nevertheless, the body of research work remains rather scant (Sun et al., 2020;Greco et al., 2019). While the majority of current NLP methods is task-specific, we believe task-agnostic approaches will become much more prevalent. Contemporary methods are limited along three dimensions: data, model architectures, and hardware.
In the real world, we often deal with partial information data. Moreover, data is drawn from noni.i.d. distributions, and is subject to agents' interventions or environmental changes. Although attempts exist, where a model learns from a stream of examples without knowing from which dataset and distribution they originate from (e.g. d'Autume et al., 2019), such approaches are rare. Furthermore, learning on a very few examples (e.g. via few-shot transfer learning) (Liu, 2020) is a major challenge for current models, even more so performing out-of-distribution generalization (Bengio, 2019). In particular, widely used in NLP sequence-to-sequence models still struggle with systematic generalization (Lake and Baroni, 2018;Bahdanau et al., 2019), being unable to learn general rules and reason about highlevel language concepts. For instance, recent work on counterfactual language representations by Feder et al. (2020) is a promising step in that direction. The non-stationary learning problem can be alleviated by understanding and inferring causal relations from data (e.g. Osawa et al., 2019) -which is an outstanding challenge (Pearl, 2009) -and coming up with combinations that are unlikely to be present in training distributions (Bengio, 2019). Namely, language is compositional; hence, the model can dynamically manipulate the semantic concepts which can be recombined in novel situations (Lake et al., 2015) and later supported by language-based abductive reasoning (e.g. Bhagavatula et al., 2020).
On a model level, a combination of CL with Bayesian principles should allow to identify better the importance of each parameter of an NN and aid parameter pruning and quantization (e.g. Ebrahimi et al., 2020;Golkar et al., 2019). We believe that not only the parameter informativeness should be uncertaintyguided, but also the periodic replay of previous memories should be informed by causality. Furthermore, it is important to focus on reducing model capacity and computing requirements. Even though the overparametrization of NNs is pervasive (Neyshabur et al., 2018), many current CL approaches promote the expansion of parameter space. We envision further research efforts focused on compression methods, such as knowledge distillation, low-rank factorization and model pruning. Importantly, while CL allows for continuous adaptation, we believe that integrating CL with meta-learning has the potential to further unlock generalization capabilities in NLP. As meta-learning is able to efficiently learn with limited samples, hence such a CL model would adapt quicker in dynamic environments (e.g. Ritter et al., 2018;Al-Shedivat et al., 2018). This would be especially beneficial for NLP systems operating in low-resource language and domain settings.
Finally, further research aiming at developing comprehensive benchmarks for CL in NLP would be an important addition to the existing studies. On the one hand, we observe a proliferation of multi-task benchmarks (e.g. McCann et al., 2018;Wang et al., , 2019a. On the other hand, the CL paradigm and evaluation of CL systems call for more robust approaches than traditional performance metrics (e.g. accuracy, F1 measure) and multi-task evaluation schemes with clearly defined data and task boundaries.

Conclusion
In this work, we provided a comprehensive overview of existing research on CL in NLP. We presented a classification of ML paradigms and methods for alleviating CF, as well as discussed how they are applied to various NLP tasks. Also, we summarized available benchmark datasets and evaluation approaches. Finally, we identified research gaps and outlined directions for future research endeavors. We hope this survey sparks interest in CL in NLP and inspires to view linguistic intelligence in a more holistic way.