Towards Realistic Practices In Low-Resource Natural Language Processing: The Development Set

Development sets are impractical to obtain for real low-resource languages, since using all available data for training is often more effective. However, development sets are widely used in research papers that purport to deal with low-resource natural language processing (NLP). Here, we aim to answer the following questions: Does using a development set for early stopping in the low-resource setting influence results as compared to a more realistic alternative, where the number of training epochs is tuned on development languages? And does it lead to overestimation or underestimation of performance? We repeat multiple experiments from recent work on neural models for low-resource NLP and compare results for models obtained by training with and without development sets. On average over languages, absolute accuracy differs by up to 1.4%. However, for some languages and tasks, differences are as big as 18.0% accuracy. Our results highlight the importance of realistic experimental setups in the publication of low-resource NLP research results.


Introduction
Parametric machine learning models are frequently trained by minimizing the loss on the training set T , for model parameters θ and a predefined loss function l. Gradient-based optimizers minimize this loss L(θ) by updating θ in the direction of the gradient ∇L(θ). A low loss characterizes a model which makes accurate predictions for examples in T . However, since T is finite, overfitting the training set might lead to poor generalization performance. One way to avoid fitting Equation 1 too # train # dev ES Bollmann et al. (2018) 5k 12k-46k Yes  400-700 100-200 Yes Makarov and Clematide (2018) 100 1k Yes Sharma et al. (2018) 100 100 Yes Schulz et al. (2018) 1k-21k 9k N/A Upadhyay et al. (2018) 500 1k Yes ES=early stopping on the development set. Experiments from papers in bold will be revisited here.
closely is early stopping: a separate development or validation set is used to end training as soon as the loss on the development set L D (θ) starts increasing or model performance on the development set D starts decreasing. The best set of parameters θ is used in the final model. This works well when large amounts of data are available to create training, development and test splits. Recently, however, with the success of pretraining (Peters et al., 2018;Devlin et al., 2019) and multi-task learning (Caruana, 1997;Ruder, 2017;Wang et al., 2019) approaches, neural models are showing promising results on various natural language processing (NLP) tasks also in lowresource or few-shot settings (Johnson et al., 2017;Kann et al., 2017;Yu et al., 2018). Often, the high-resource experimental setup and training procedure are kept unchanged, and the size of the original training set is reduced to simulate limited data. This leads to settings where validation examples may outnumber training examples. Table  1 shows such cases for the tasks of historical text normalization (Bollmann et al., 2018), morphological segmentation , morphological inflection (Makarov and Clematide, 2018;Sharma et al., 2018), argument component identification (Schulz et al., 2018), and transliteration (Upadhyay et al., 2018).
However, in a real-world setting with limited resources, it is unlikely that such a development set would be available for early stopping, since it would be more effective to use at least part of it for training instead. Here, we investigate how previous results relate to those obtained in a setting that does not assume a development set. Instead of early stopping, we use data from the same task in other languages, the development languages, to decide on the number of training epochs. We are interested in two questions: Does recent work in low-resource NLP overestimate model performance by using an unrealistically precise performance signal to stop training? Or, inversely, is model performance underestimated by overfitting the finite development set?
Our experiments on historical text normalization, morphological inflection, and transliteration, featuring a variety of languages, show that performance does differ between runs with and without early stopping on the development set; if using the development set leads to better or worse results depends on the task and language. Differences of up to 18% absolute accuracy highlight that a realistic evaluation of models for low-resource NLP is crucial for estimating real-world performance.

Related Work
Realistic evaluation of machine learning. Oliver et al. (2018) investigate how to evaluate semi-supervised training algorithms in a realistic way; they differ from us in that they focus exclusively on semi-supervised learning (SSL) algorithms, and do not consider NLP explicitly. However, in line with our conclusion, they report that recent practices for evaluating SSL techniques do not address the question of the algorithms' real-word applicability in a satisfying way. In NLP, several earlier works have explicitly investigated real-world low-resource settings as opposed to artificial proxy settings, e.g., for part-of-speech tagging (Garrette et al., 2013) or machine translation (Irvine and Callison-Burch, 2013). While those mostly focus on real data-poor languages, we explicitly investigate the effect of the common practice to assume a relatively large development set for early stopping in the low-resource setting.
Low-resource settings in NLP. Research in the area of neural methods for low-resource NLP has gained popularity in recent years, with a dedicated workshop on the topic appearing in 2018 (Haffari et al., 2018). High-level key words under which other work on neural networks for data-poor scenarios in NLP can be found are domain adaptation (Daume III, 2007), multi-task learning (Caruana, 1997;Ruder, 2017), few-shot/zero-shot/oneshot learning (Johnson et al., 2017;Finn et al., 2017), transfer learning (Yarowsky et al., 2001), semi-supervised training (Zhu, 2005), or pretraining (Erhan et al., 2010).
While options for early stopping without a development set exist (Mahsereci et al., 2017), they require hyperparameter tuning, which might not be feasible without a development set, and, most importantly, they are not commonly used in lowresource NLP research. Here, we investigate if current practices might lead to unrealistic results.

Experimental Design
We compare early stopping using development set accuracy (DevSet) with an alternative strategy where the amount of training epochs is a hyperparameter tuned on development languages (DevLang). We perform two rounds of training: • Stopping point selection phase. Models for the development languages are trained with the original early stopping strategy from previous work. The number of training epochs for the target languages is then calculated as the average over the best epochs for all development languages. 1 All development languages also function as target languages.
To make this possible, for development languages, we compute the average over other development languages only.
• Main training phase. We train models for all languages keeping both the model resulting from the original early stopping strategy (DevSet) and that from the epoch computed in the stopping point selection phase (DevLang). 2 The stopping point selection phase exclusively serves the purpose of tuning the number of epochs for the DevLang training setup. Models obtained in this phase are discarded. The development sets we use in our experiments are those from the original papers without alterations.
Since both final models obtained in the main training phase result from the same training run, our experimental design enables a direct comparison between the models from both setups.
Example. Assume that, in the stopping point selection phase, we obtain the best development set results for a given task in development languages L1 and L2 after epochs 14 and 18, respectively. In the main training phase, we then train a model for the same task in target language L3 with the original early stopping strategy, but keeping additionally the model from epoch 16. If the best development result for language L3 is obtained after epoch 19, we compare the model from epoch 19 (DevSet) to that from epoch 16 (DevLang).

Tasks, Data, Models
For our study, we select previously published experiments which fulfill the following criteria: (1) datasets exist for at least four languages, and all training sets are of equal size; (2) the original authors use early stopping with a development set; (3) the authors explicitly investigate low-resource settings; and (4) the original code is publically available, or a standard model is used. Since our main goal is to confirm the effect of the development set and not to compare between tasks, we further limit this study to sequence-to-sequence tasks.

Historical Text Normalization (NORM)
Task. The goal of historical text normalization is to convert old texts into a form that conforms with contemporary spelling conventions. Historical text normalization is a specific case of the general task of text normalization, which additionally encompasses, e.g., correction of spelling mistakes or normalization of social media text.
Data. We experiment on the ten datasets from Bollmann et al. (2018), which represent eight different languages: German (two datasets; Bollmann et al., 2017;Odebrecht et al., 2017); English, Hungarian, Icelandic, and Swedish (Pettersson, 2016); Slovene (two datasets; Ljubešic et al., 2016); and Spanish and Portuguese (Vaamonde, 2015). We treat the two datasets for German and Slovene as different languages. All languages serve both as development languages for all other languages and as target languages.
Model. Our model for this task is an LSTM (Hochreiter and Schmidhuber, 1997) encoderdecoder model with attention (Bahdanau et al., 2015). Both encoder and decoder have a single hidden layer. We use the default model in Open-NMT (Klein et al., 2017) 3 as our implementation and employ the hyperparameters from Bollmann et al. (2018). In the original paper, early stopping is done by training for 50 epochs, and the best model regarding development accuracy is applied to the test set.

Morphological Inflection (MORPH)
Task. Morphological inflection consists of mapping the canonical form of a word, the lemma, to an indicated inflected form. This task gets very complex for morphologically rich languages, where a single lemma can have hundreds or thousand of inflected forms. Recently, morphological inflection has frequently been cast as a sequenceto-sequence task, mapping the characters of the input word together with the morphological features specifying the target to the characters of the corresponding inflected form (Cotterell et al., 2018).
Data. We experiment on the datasets released for a 2018 shared task (Cotterell et al., 2018), which cover 103 languages and feature an explicit low-resource setting. We randomly choose ten development languages: Armenian, Basque, Galician, Georgian, Greenlandic, Icelandic, Karbadian, Kannada, Latin, and Lithuanian.
Model. For MORPH, we experiment with a pointer-generator network architecture (Gu et al., 2016;See et al., 2017). This is a sequence-tosequence model similar to that for NORM, but employs separate encoders for characters and features. It is further equipped with a copy mechanism: using attention to decide on what element from the input sequence to copy, the model computes a probability for either copying or generation while producing an output. The final probability distribution over the target vocabulary is a combination of both. Hyperparameters are taken from Sharma et al. (2018). 4 For early stopping, we also follow Sharma et al. (2018): all models are trained for at least 300 epochs, and training is continued for another 100 epochs each time there has been improvement on the development set within the last 100 epochs.

Transliteration (TRANSL)
Task. Transliteration is the task of converting names from one script into another, while staying as close to the original pronunciation as possible. Unlike for translation, focus lies on the sound; the target language meaning is usually ignored.
Data. For our transliteration experiments, we follow Upadhyay et al. (2018). We experiment on datasets from the Named Entities Workshop 2015 (Duan et al., 2015) in Hindi, Kannada, Bengali, Tamil, and Hebrew. For this task, all languages are both development and target languages.
Model. The last featured model is an LSTM sequence-to-sequence model similar to that by Bahdanau et al. (2015), except for using hard monotonic attention (Aharoni and Goldberg, 2017). It attends to a single character at a time, and attention moves monotonically over the input. We take hyperparameters and code from Upadhyay et al. (2018). 5 Early stopping is done by training for 20 epochs and applying the best model regarding development accuracy to the test data.

Experimental Setup
We run all experiments using the implementations from previous work or OpenNMT as described above. Existing code is only modified where necessary. Most importantly, we add storing of the DevLang model during the main training phase.

Development sets vs. development languages.
We are asking if the use of a development set for early stopping leads to over-or underestimation of realistic model performance. Thus, we show in Table 2 how often we obtain higher accuracy for each of DevLang and DevSet. Additionally, averaged performance over all languages as well as the maximum difference in absolute accuracy are listed in Table 3. For NORM and TRANSL, results for individual languages are shown in Tables 4 and 5, respectively; for detailed results for MORPH see Appendix A. We see in Table 2    large development sets leads to better results than DevLang for 72 and 8 languages, respectively. For 8 and, respectively, 2 languages there is no difference, and a look at the detailed results in Appendix A and Table 4 reveals that, for those cases, we end up training for the same number of epochs for DevLang and DevSet. Only in 23 cases for MORPH, and none for NORM, we obtain better results for DevLang. This suggests that, for these two tasks, we frequently overestimate realistic model performance by early stopping on the development set. Indeed, Table 3 confirms this finding and shows that, on average across languages, DevSet models outperform DevLang models. The maximum difference is 18% absolute accuracy for the language Azeri and MORPH: for DevSet, we reach 64% accuracy after epoch 217, while, for DevLang, we only obtain 46% accuracy after the predefined epoch 324.
We obtain a different picture for TRANSL: results are equal for 3 languages, and better for DevLang for the remaining 2. Equal performance might be explained by the overall smaller number of training epochs in the original regime: stopping at the same epoch for both strategies is more likely. Overall, for TRANSL, performance on the development set seems to be less predictive for the final test performance than for the other tasks.
Influence of the final epoch. Since without a development set performance decreases on MORPH for most languages, we investigate if this can be explained by training often being too short. Therefore, we plot the difference of training du-

Discussion And Conclusion
Limitations. We investigate the effect of early stopping on the validation set as compared to a realistic setting without target language development examples. However, we would like to point out that, in certain situations, standard practices might be sufficient, e.g., for comparing different methods in equal settings, if absolute performance is not the main focus. Further, we do not claim to show that using a validation set always overor underestimates realworld performance, since this depends on how representative the validation set is of the target distribution. Our main result is that using a development set gives a poor estimate of real-world performance and that it is important to be aware of potential performance differences.
Practical take-aways. We replicate experiments from recent low-resource NLP research, once with the original experimental design and once without assuming development sets for early stopping. Since differences in absolute accuracy are up to 18.0%, we conclude that low-resource NLP research should move away from using large development sets for early stopping whenever real-world settings are being considered.