A Survey on Recent Approaches for Natural Language Processing in Low-Resource Scenarios

Deep neural networks and huge language models are becoming omnipresent in natural language applications. As they are known for requiring large amounts of training data, there is a growing body of work to improve the performance in low-resource settings. Motivated by the recent fundamental changes towards neural models and the popular pre-train and fine-tune paradigm, we survey promising approaches for low-resource natural language processing. After a discussion about the different dimensions of data availability, we give a structured overview of methods that enable learning when training data is sparse. This includes mechanisms to create additional labeled data like data augmentation and distant supervision as well as transfer learning settings that reduce the need for target supervision. A goal of our survey is to explain how these methods differ in their requirements as understanding them is essential for choosing a technique suited for a specific low-resource setting. Further key aspects of this work are to highlight open issues and to outline promising directions for future research.


Introduction
Most of today's research in natural language processing (NLP) is concerned with the processing of 10 to 20 high-resource languages with a special focus on English, and thus, ignores thousands of languages with billions of speakers (Bender, 2019). The rise of data-hungry deep learning systems increased the performance of NLP for high resourcelanguages, but the shortage of large-scale data in less-resourced languages makes their processing a challenging problem. Therefore, Ruder (2019) named NLP for low-resource scenarios one of the four biggest open problems in NLP nowadays.
The umbrella term low-resource covers a spectrum of scenarios with varying resource conditions. * equal contribution It includes work on threatened languages, such as Yongning Na, a Sino-Tibetan language with 40k speakers and only 3k written, unlabeled sentences (Adams et al., 2017). Other languages are widely spoken but seldom addressed by NLP research. More than 310 languages exist with at least one million L1-speakers each (Eberhard et al., 2019). Similarly, Wikipedia exists for 300 languages. 1 Supporting technological developments for low-resource languages can help to increase participation of the speakers' communities in a digital world. Note, however, that tackling low-resource settings is even crucial when dealing with popular NLP languages as low-resource settings do not only concern languages but also non-standard domains and tasks, for which -even in English -only little training data is available. Thus, the term "language" in this paper also includes domain-specific language.
This importance of low-resource scenarios and the significant changes in NLP in the last years have led to active research on resource-lean settings and a wide variety of techniques have been proposed. They all share the motivation of overcoming the lack of labeled data by leveraging further sources. However, these works differ greatly on the sources they rely on, e.g., unlabeled data, manual heuristics or cross-lingual alignments. Understanding the requirements of these methods is essential for choosing a technique suited for a specific low-resource setting. Thus, one key goal of this survey is to highlight the underlying assumptions these techniques take regarding the low-resource setup.
In this work, we (1) give a broad and structured overview of current efforts on low-resource NLP, (2) analyse the different aspects of low-resource settings, (3) highlight the necessary resources and data assumptions as guidance for practitioners and (4) discuss open issues and promising future direc-  tions. Table 1 gives an overview of the surveyed techniques along with their requirements a practitioner needs to take into consideration.

Related Surveys
Recent surveys cover low-resource machine translation  and unsupervised domain adaptation (Ramponi and Plank, 2020). Thus, we do not investigate these topics further in this paper, but focus instead on general methods for lowresource, supervised natural language processing including data augmentation, distant supervision and transfer learning. This is also in contrast to the task-specific survey by Magueresse et al. (2020) who review highly influential work for several extraction tasks, but only provide little overview of recent approaches. In Table 2 in the appendix, we list past surveys that discuss a specific method or lowresource language family for those readers who seek a more specialized follow-up.

Aspects of "Low-Resource"
To visualize the variety of resource-lean scenarios, Figure 1 shows exemplarily which NLP tasks were addressed in six different languages from basic to higher-level tasks. While it is possible to build English NLP systems for many higher-level applications, low-resource languages lack the data foundation for this. Additionally, even if it is possible to create basic systems for tasks, such as tokenization and named entity recognition, for all tested low-resource languages, the training data is typical of lower quality compared to the English datasets, Note that the figure does not incorporate data quality or system performance. More details on the selection of tasks and languages are given in the appendix Section B. or very limited in size. It also shows that the four American and African languages with between 1.5 and 60 million speakers have been addressed less than the Estonian language, with 1 million speakers. This indicates the unused potential to reach millions of speakers who currently have no access to higher-level NLP applications. Joshi et al. (2020) study further the availability of resources for languages around the world.

Dimensions of Resource Availability
Many techniques presented in the literature depend on certain assumptions about the low-resource sce-nario. These have to be adequately defined to evaluate their applicability for a specific setting and to avoid confusion when comparing different approaches. We propose to categorize low-resource settings along the following three dimensions: (i) The availability of task-specific labels in the target language (or target domain) is the most prominent dimension in the context of supervised learning. Labels are usually created through manual annotation, which can be both time-and costintensive. Not having access to adequate experts to perform the annotation can also be an issue for some languages and domains.
(ii) The availability of unlabeled language-or domain-specific text is another factor, especially as most modern NLP approaches are based on some form of input embeddings trained on unlabeled texts.
(iii) Most of the ideas surveyed in the next sections assume the availability of auxiliary data which can have many forms. Transfer learning might leverage task-specific labels in a different language or domain. Distant supervision utilizes external sources of information, such as knowledge bases or gazetteers. Some approaches require other NLP tools in the target language like machine translation to generate training data. It is essential to consider this as results from one low-resource scenario might not be transferable to another one if the assumptions on the auxiliary data are broken.

How Low is Low-Resource?
On the dimension of task-specific labels, different thresholds are used to define low-resource. For part-of-speech (POS) tagging, Garrette and Baldridge (2013) limit the time of the annotators to 2 hours resulting in up to 1-2k tokens. Kann et al.
(2020) study languages that have less than 10k labeled tokens in the Universal Dependency project (Nivre et al., 2020) and Loubser and Puttkammer (2020) report that most available datasets for South African languages have 40-60k labeled tokens. The threshold is also task-dependent and more complex tasks might also increase the resource requirements. For text generation,  frame their work as low-resource with 350k labeled training instances. Similar to the task, the resource requirements can also depend on the language. Plank et al. (2016) find that task performance varies between language families given the same amount of limited training data.
Given the lack of a hard threshold for lowresource settings, we see it as a spectrum of resource availability. We, therefore, also argue that more work should evaluate low-resource techniques across different levels of data availability for better comparison between approaches. For instance, Plank et al. (2016) and Melamud et al. (2019) show that for very small datasets non-neural methods outperform more modern approaches while the latter obtain better performance in resource-lean scenarios once a few hundred labeled instances are available.

Generating Additional Labeled Data
Faced with the lack of task-specific labels, a variety of approaches have been developed to find alternative forms of labeled data as substitutes for goldstandard supervision. This is usually done through some form of expert insights in combination with automation. We group the ideas into two main categories: data augmentation which uses task-specific instances to create more of them ( § 4.1) and distant supervision which labels unlabeled data ( § 4.2) including cross-lingual projections ( § 4.3). Additional sections cover learning with noisy labels ( § 4.4) and involving non-experts ( § 4.5).

Data Augmentation
New instances can be obtained based on existing ones by modifying the features with transformations that do not change the label. In the computer vision community, this is a popular approach where, e.g., rotating an image is invariant to the classification of an image's content. For text, on the token level, this can be done by replacing words with equivalents, such as synonyms (Wei and Zou, 2019), entities of the same type (Raiman and Miller, 2017;Dai and Adel, 2020) or words that share the same morphology (Gulordava et al., 2018;Vania et al., 2019). Such replacements can also be guided by a language model that takes context into consideration (Fadaee et al., 2017;Kobayashi, 2018).
To go beyond the token level and add more diversity to the augmented sentences, data augmentation can also be performed on sentence parts. Operations that (depending on the task) do not change the label include manipulation of parts of the dependency tree (Şahin and Steedman, 2018;Vania et al., 2019;Dehouck and Gómez-Rodríguez, 2020), simplification of sentences by removal of sentence parts (Şahin and Steedman, 2018) and inversion of the subject-object relation (Min et al., 2020). For whole sentences, paraphrasing through backtranslation can be used. This is a popular approach in machine translation where target sentences are back-translated into source sentences (Bojar and Tamchyna, 2011;Hoang et al., 2018). An important aspect here is that errors in the source side/features do not seem to have a large negative effect on the generated target text the model needs to predict. It is therefore also used in other text generation tasks like abstract summarization (Parida and Motlicek, 2019) and table-to-text generation (Ma et al., 2019). Back-translation has also been leveraged for text classification (Xie et al., 2020;Hegde and Patil, 2020). This setting assumes, however, the availability of a translation system. Instead, a language model can also be used for augmenting text classification datasets (Kumar et al., 2020;Anaby-Tavor et al., 2020). It is trained conditioned on a label, i.e., on the subset of the task-specific data with this label. It then generates additional sentences that fit this label. Ding et al. (2020) extend this idea for token level tasks.
Adversarial methods are often used to find weaknesses in machine learning models (Jin et al., 2020;Garg and Ramakrishnan, 2020). They can, however, also be utilized to augment NLP datasets (Yasunaga et al., 2018;Morris et al., 2020). Instead of manually crafted transformation rules, these methods learn how to apply small perturbations to the input data that do not change the meaning of the text (according to a specific score). This approach is often applied on the level of vector representations. For instance, Grundkiewicz et al. (2019) reverse the augmentation setting by applying transformations that flip the (binary) label. In their case, they introduce errors in correct sentences to obtain new training data for a grammar correction task.
Open Issues: While data augmentation is ubiquitous in the computer vision community and while most of the above-presented approaches are taskindependent, it has not found such widespread use in natural language processing. A reason might be that several of the approaches require an indepth understanding of the language. There is not yet a unified framework that allows applying data augmentation across tasks and languages. Recently, Longpre et al. (2020) hypothesised that data augmentation provides the same benefits as pretraining in transformer models. However, we argue that data augmentation might be better suited to leverage the insights of linguistic or domain experts in low-resource settings when unlabeled data or hardware resources are limited.

Distant & Weak Supervision
In contrast to data augmentation, distant or weak supervision uses unlabeled text and keeps it unmodified. The corresponding labels are obtained through a (semi-)automatic process from an external source of information. For named entity recognition (NER), a list of location names might be obtained from a dictionary and matches of tokens in the text with entities in the list are automatically labeled as locations. Distant supervision was introduced by Mintz et al. (2009) for relation extraction (RE) with extensions on multi-instance (Riedel et al., 2010) and multi-label learning (Surdeanu et al., 2012). It is still a popular approach for information extraction tasks like NER and RE where the external information can be obtained from knowledge bases, gazetteers, dictionaries and other forms of structured knowledge sources (Luo et al., 2017;Hedderich and Klakow, 2018;Deng and Sun, 2019;Alt et al., 2019;Ye et al., 2019;Lange et al., 2019a;Nooralahzadeh et al., 2019;Le and Titov, 2019;Cao et al., 2019;Lison et al., 2020;Hedderich et al., 2021a). The automatic annotation ranges from simple string matching  to complex pipelines including classifiers and manual steps (Norman et al., 2019). This distant supervision using information from external knowledge sources can be seen as a subset of the more general approach of labeling rules. These encompass also other ideas like reg-ex rules or simple programming functions (Ratner et al., 2017;Zheng et al., 2019;Adelani et al., 2020;Hedderich et al., 2020;Lison et al., 2020;Ren et al., 2020;Karamanolakis et al., 2021).
While distant supervision is popular for information extraction tasks like NER and RE, it is less prevalent in other areas of NLP. Nevertheless, distant supervision has also been successfully em-  (2020) build a discourse-structure dataset using guidance from sentiment annotations. For topic classification, heuristics can be used in combination with inputs from other classifiers like NER (Bach et al., 2019) or from entity lists (Hedderich et al., 2020). For some classification tasks, the labels can be rephrased with simple rules into sentences. A pretrained language model then judges the label sentence that most likely follows the unlabeled input (Opitz, 2019;). An unlabeled review, for instance, might be continued with "It was great/bad" for obtaining binary sentiment labels.
Open Issues: The popularity of distant supervision for NER and RE might be due to these tasks being particularly suited. There, auxiliary data like entity lists is readily available and distant supervision often achieves reasonable results with simple surface form rules. It is an open question whether a task needs to have specific properties to be suitable for this approach. The existing work on other tasks and the popularity in other fields like image classification (Xiao et al., 2015;Li et al., 2017;Lee et al., 2018;Mahajan et al., 2018; suggests, however, that distant supervision could be leveraged for more NLP tasks in the future. Distant supervision methods heavily rely on auxiliary data. In a low-resource setting, it might be difficult to obtain not only labeled data but also such auxiliary data. Kann et al. (2020) find a large gap between the performance on high-resource and low-resource languages for POS tagging pointing to the lack of high-coverage and error-free dictionaries for the weak supervision in low-resource languages. This emphasizes the need for evaluating such methods in a realistic setting and avoiding to just simulate restricted access to labeled data in a high-resource language.
While distant supervision allows obtaining labeled data more quickly than manually annotating every instance of a dataset, it still requires human interaction to create automatic annotation techniques or to provide labeling rules. This time and effort could also be spent on annotating more gold label data, either naively or through an active learning scheme. Unfortunately, distant supervision papers rarely provide information on how long the creation took, making it difficult to compare these approaches. Taking the human expert into the focus connects this research direction with humancomputer-interaction and human-in-the-loop setups (Klie et al., 2018;Qian et al., 2020).

Cross-Lingual Annotation Projections
For cross-lingual projections, a task-specific classifier is trained in a high-resource language. Using parallel corpora, the unlabeled low-resource data is then aligned to its equivalent in the highresource language where labels can be obtained using the aforementioned classifier. These labels (on the high-resource text) can then be projected back to the text in the low-resource language based on the alignment between tokens in the parallel texts (Yarowsky et al., 2001). This approach can, therefore, be seen as a form of distant supervision specific for obtaining labeled data for lowresource languages. Cross-lingual projections have been applied in low-resource settings for tasks, such as POS tagging and parsing (Täckström et al., 2013;Wisniewski et al., 2014;Plank and Agić, 2018;Eskander et al., 2020). Sources for parallel text can be the OPUS project (Tiedemann, 2012), Bible corpora (Mayer and Cysouw, 2014;Christodoulopoulos and Steedman, 2015) or the recent JW300 corpus (Agić and Vulić, 2019). Instead of using parallel corpora, existing high-resource labeled datasets can also be machine-translated into the low-resource language (Khalil et al., 2019;Zhang et al., 2019a;Fei et al., 2020;Amjad et al., 2020). Cross-lingual projections have even been used with English as a target language for detecting linguistic phenomena like modal sense and telicity that are easier to identify in a different language (Zhou et al., 2015;Marasović et al., 2016;Friedrich and Gateva, 2017).
Open issues: Cross-lingual projections set high requirements on the auxiliary data needing both labels in a high-resource language and means to project them into a low-resource language. Especially the latter can be an issue as machine translation by itself might be problematic for a specific low-resource language. A limitation of the parallel corpora is their domains like political proceedings or religious texts. Mayhew et al. (2017), Fang and Cohn (2017) and Karamanolakis et al. (2020) propose systems with fewer requirements based on word translations, bilingual dictionaries and taskspecific seed words, respectively.

Learning with Noisy Labels
The above-presented methods allow obtaining labeled data quicker and cheaper than manual annotations. These labels tend, however, to contain more errors. Even though more training data is available, training directly on this noisily-labeled data can actually hurt the performance. Therefore, many recent approaches for distant supervision use a noise handling method to diminish the negative effects of distant supervision. We categorize these into two ideas: noise filtering and noise modeling.
Noise filtering methods remove instances from the training data that have a high probability of being incorrectly labeled. This often includes training a classifier to make the filtering decision. The filtering can remove the instances completely from the training data, e.g., through a probability threshold (Jia et al., 2019), a binary classifier (Adel and Schütze, 2015; Onoe and Durrett, 2019; Huang and Du, 2019), or the use of a reinforcement-based agent Nooralahzadeh et al., 2019). Alternatively, a soft filtering might be applied that re-weights instances according to their probability of being correctly labeled (Le and Titov, 2019) or an attention measure (Hu et al., 2019).
The noise in the labels can also be modeled. A common model is a confusion matrix estimating the relationship between clean and noisy labels (Fang and Cohn, 2016;Luo et al., 2017;Hedderich and Klakow, 2018;Paul et al., 2019;Lange et al., 2019a,c;Chen et al., 2019;Wang et al., 2019;Hedderich et al., 2021b). The classifier is no longer trained directly on the noisily-labeled data. Instead, a noise model is appended which shifts the noisy to the (unseen) clean label distribution. This can be interpreted as the original classifier being trained on a "cleaned" version of the noisy labels. In Ye et al. (2019), the prediction is shifted from the noisy to the clean distribution during testing. In Chen et al. (2020a), a group of reinforcement agents relabels noisy instances. Rehbein and Ruppenhofer (2017), Lison et al. (2020) and Ren et al. (2020) leverage several sources of distant supervision and learn how to combine them.
In NER, the noise in distantly supervised labels tends to be false negatives, i.e., mentions of entities that have been missed by the automatic method. Partial annotation learning Nooralahzadeh et al., 2019;Cao et al., 2019) takes this into account explicitly. Related approaches learn latent variables (Jie et al., 2019), use constrained binary learning (Mayhew et al., 2019) or construct a loss assuming that only unlabeled positive instances exist (Peng et al., 2019).

Non-Expert Support
As an alternative to an automatic annotation process, annotations might also be provided by nonexperts. Similar to distant supervision, this results in a trade-off between label quality and availability. For instance, Garrette and Baldridge (2013) obtain labeled data from non-native-speakers and without a quality control on the manual annotations. This can be taken even further by employing annotators who do not speak the low-resource language (Mayhew and Roth, 2018;Mayhew et al., 2019;Tsygankova et al., 2020).
Nekoto et al. (2020) take the opposite direction, integrating speakers of low-resource languages without formal training into the model development process in an approach of participatory research. This is part of recent work on how to strengthen low-resource language communities and grassroot approaches (Alnajjar et al., 2020;Adelani et al., 2021).

Transfer Learning
While distant supervision and data augmentation generate and extend task-specific training data, transfer learning reduces the need for labeled target data by transferring learned representations and models. A strong focus in recent works on transfer learning in NLP lies in the use of pre-trained language representations that are trained on unlabeled data like BERT (Devlin et al., 2019). Thus, this section starts with an overview of these methods ( § 5.1) and then discusses how they can be utilized in low-resource scenarios, in particular, regarding the usage in domain-specific ( § 5.2) or multilingual low-resource settings ( § 5.3).

Pre-Trained Language Representations
Feature vectors are the core input component of many neural network-based models for NLP tasks. They are numerical representations of words or sentences, as neural architectures do not allow the processing of strings and characters as such. Collobert et al. (2011) showed that training these models for the task of language-modeling on a large-scale corpus results in high-quality word representations, which can be reused for other downstream tasks as well. Subword-based embeddings such as fastText n-gram embeddings (Bojanowski et al., 2017) and byte-pair-encoding embeddings (Heinzerling and Strube, 2018) addressed out-of-vocabulary issues by splitting words into multiple subwords, which in combination represent the original word.  showed that these embeddings leveraging subword information are beneficial for lowresource sequence labeling tasks, such as named entity recognition and typing, and outperform wordlevel embeddings. Jungmaier et al. (2020) added smoothing to word2vec models to correct its bias towards rare words and achieved improvements in particular for low-resource settings. In addition, pre-trained embeddings were published for more than 270 languages for both embedding methods. This enabled the processing of texts in many languages, including multiple low-resource languages found in Wikipedia. More recently, a trend emerged of pre-training large embedding models using a language model objective to create contextaware word representations by predicting the next word or sentence. This includes pre-trained transformer models (Vaswani et al., 2017), such as BERT (Devlin et al., 2019) or RoBERTa (Liu et al., 2019b). These methods are particularly helpful for low-resource languages for which large amounts of unlabeled data are available, but task-specific labeled data is scarce (Cruz and Cheng, 2019).
Open Issues: While pre-trained language models achieve significant performance increases compared to standard word embeddings, it is still questionable if these methods are suited for real-world low-resource scenarios. For example, all of these models require large hardware requirements, in particular, considering that the transformer model size keeps increasing to boost performance (Raffel et al., 2020). Therefore, these large-scale methods might not be suited for low-resource scenarios where hardware is also low-resource. Biljon et al. (2020) showed that low-to mediumdepth transformer sizes perform better than larger models for low-resource languages and Schick and Schütze (2020) managed to train models with three orders of magnitude fewer parameters that perform on-par with large-scale models like GPT-3 on few-shot task by reformulating the training task and using ensembling. Melamud et al. (2019) showed that simple bag-of-words approaches are better when there are only a few dozen training instances or less for text classification, while more complex transformer models require more training data. Bhattacharjee et al. (2020) found that crossview training (Clark et al., 2018) leverages large amounts of unlabeled data better for task-specific applications in contrast to the general representations learned by BERT. Moreover, data quality for low-resource, even for unlabeled data, might not be comparable to data from high-resource languages. Alabi et al. (2020) found that word embeddings trained on larger amounts of unlabeled data from low-resource languages are not competitive to embeddings trained on smaller, but curated data sources.

Domain-Specific Pre-Training
The language of a specialized domain can differ tremendously from what is considered the standard language, thus, many text domains are often lessresourced as well. For example, scientific articles can contain formulas and technical terms, which are not observed in news articles. However, the majority of recent language models are pre-trained on general-domain data, such as texts from the news or web-domain, which can lead to a so-called "domain-gap" when applied to a different domain.
One solution to overcome this gap is the adaptation to the target domain by finetuning the language model. Gururangan et al. (2020) showed that continuing the training of a model with additional domain-adaptive and task-adaptive pretraining with unlabeled data leads to performance gains for both high-and low-resource settings for numerous English domains and tasks. This is also displayed in the number of domain-adapted language models ( 2020) showed that a generaldomain BERT model performs well in the materials science domain, but the domain-adapted SciBERT performs best.  used in-and out-ofdomain data to pre-train a domain-specific model and adapt it to low-resource domains. Aharoni and Goldberg (2020) found domain-specific clusters in pre-trained language models and showed how these could be exploited for data selection in domain-sensitive training.
Powerful representations can be achieved by combining high-resource embeddings from the gen-eral domain with low-resource embeddings from the target domain (Akbik et al., 2018;Lange et al., 2019b). Kiela et al. (2018) showed that embeddings from different domains can be combined using attention-based meta-embeddings, which create a weighted sum of all embeddings. Lange et al.
(2020b) further improved on this by aligning embeddings trained on diverse domains using an adversarial discriminator that distinguishes between the embedding spaces to generate domain-invariant representations.

Multilingual Language Models
Analogously to low-resource domains, lowresource languages can also benefit from labeled resources available in other high-resource languages. This usually requires the training of multilingual language representations by combining monolingual representations (Lange et al., 2020a) or training a single model for many languages, such as multilingual BERT (Devlin et al., 2019) or XLM-RoBERTa (Conneau et al., 2020) . These models are trained using unlabeled, monolingual corpora from different languages and can be used in crossand multilingual settings, due to many languages seen during pre-training.
In cross-lingual zero-shot learning, no taskspecific labeled data is available in the low-resource target language. Instead, labeled data from a high-resource language is leveraged. A multilingual model can be trained on the target task in a high-resource language and afterwards, applied to the unseen target languages, such as for named entity recognition ( 2020) proposed adding a minimal amount of target-task and -language data (in the range of 10 to 100 labeled sentences) which resulted in a significant boost in performance for classification in low-resource languages.
The transfer between two languages can be improved by creating a common multilingual embedding space of multiple languages. This is useful for standard word embeddings  as well as pre-trained language models. For example, by aligning the languages inside a single multilin- This alignment is typically done by computing a mapping between two different embedding spaces, such that the words in both embeddings share similar feature vectors after the mapping (Mikolov et al., 2013;Joulin et al., 2018). This allows to use different embeddings inside the same model and helps when two languages do not share the same space inside a single model (Cao et al., 2020). For example, Zhang et al. (2019b) used bilingual representations by creating cross-lingual word embeddings using a small set of parallel sentences between the highresource language English and three low-resource African languages, Swahili, Tagalog, and Somali, to improve document retrieval performance for the African languages.
Open Issues: While these multilingual models are a tremendous step towards enabling NLP in many languages, possible claims that these are universal language models do not hold. For example, mBERT covers 104 and XLM-R 100 languages, which is a third of all languages in Wikipedia as outlined earlier. Further, Wu and Dredze (2020) showed that, in particular, low-resource languages are not well-represented in mBERT. Figure 2 shows which language families with at least 1 million speakers are covered by mBERT and XLM-RoBERTa 2 . In particular, African and American languages are not well-represented within the transformer models, even though millions of people speak these languages. This can be problematic, as languages from more distant language families are less suited for transfer learning, as Lauscher et al.

Ideas From Low-Resource Machine Learning in Non-NLP Communities
Training on a limited amount of data is not unique to natural language processing. Other areas, like general machine learning and computer vision, can be a useful source for insights and new ideas. We already presented data augmentation and pretraining. Another example is Meta-Learning (Finn et al., 2017), which is based on multi-task learning. Given a set of auxiliary high-resource tasks and a low-resource target task, meta-learning trains a model to decide how to use the auxiliary tasks in the most beneficial way for the target task. For NLP, this approach has been evaluated on tasks such as sentiment analysis (Yu et al., 2018), user intent classification (

Discussion and Conclusion
In this survey, we gave a structured overview of recent work in the field of low-resource natural language processing. Beyond the method-specific open issues presented in the previous sections, we see the comparison between approaches as an important point of future work. Guidelines are necessary to support practitioners in choosing the right tool for their task. In this work, we highlighted that it is essential to analyze resource-lean scenarios across the different dimensions of data-availability.
This can reveal which techniques are expected to be applicable in a specific low-resource setting. More theoretic and experimental work is necessary to understand how approaches compare to each other and on which factors their effectiveness depends. Longpre et al. (2020), for instance, hypothesized that data augmentation and pre-trained language models yield similar kind of benefits. Often, however, new techniques are just compared to similar methods and not across the range of low-resource approaches. While a fair comparison is non-trivial given the different requirements on auxiliary data, we see this endeavour as essential to improve the field of low-resource learning in the future. This could also help to understand where the different approaches complement each other and how they can be combined effectively.