Predicting Performance for Natural Language Processing Tasks

Given the complexity of combinations of tasks, languages, and domains in natural language processing (NLP) research, it is computationally prohibitive to exhaustively test newly proposed models on each possible experimental setting. In this work, we attempt to explore the possibility of gaining plausible judgments of how well an NLP model can perform under an experimental setting, without actually training or testing the model. To do so, we build regression models to predict the evaluation score of an NLP experiment given the experimental settings as input. Experimenting on~9 different NLP tasks, we find that our predictors can produce meaningful predictions over unseen languages and different modeling architectures, outperforming reasonable baselines as well as human experts. %we represent experimental settings using an array of features. Going further, we outline how our predictor can be used to find a small subset of representative experiments that should be run in order to obtain plausible predictions for all other experimental settings.


Introduction
Natural language processing (NLP) is an extraordinarily vast field, with a wide variety of models being applied to a multitude of tasks across a plenitude of domains and languages. In order to measure progress in all these scenarios, it is necessary to compare performance on test datasets representing each scenario. However, the cross-product of tasks, languages, and domains creates an explosion of potential application scenarios, and it is infeasible to collect high-quality test sets for each. In addition, even for tasks where we do have a wide variety of test data, e.g. for well-resourced tasks such as machine translation (MT), it is still computationally prohibitive as well as not environmentally friendly (Strubell et al., 2019) to build and test on systems for all languages or domains we are interested in. Because of this, the common practice is to test new methods on a small number of languages or domains, often semi-arbitrarily chosen based on previous work or the experimenters' intuition.
As a result, this practice impedes the NLP community from gaining a comprehensive understanding of newly-proposed models. Table 1 illustrates this fact with an example from bilingual lexicon induction, a task that aims to find word translation pairs from cross-lingual word embeddings. As vividly displayed in Table 1, almost all the works report evaluation results on a different subset of language pairs. Evaluating only on a small subset raises concerns about making inferences when comparing the merits of these methods: there is no guarantee that performance on English-Spanish (EN-ES, the only common evaluation dataset) is representative of the expected performance of the models over all other language pairs (Anastasopoulos and Neubig, 2020). Such phenomena lead us to consider if it is possible to make a decently accurate estimation for the performance over an untested language pair without actually running the NLP model to bypass the computation restriction.
Toward that end, through drawing on the idea of characterizing an experiment from Lin et al.
(2019), we propose a framework, which we call NLPERF, to provide an exploratory solution. We build regression models, to predict the performance on a particular experimental setting given past experimental records of the same task, with each record consisting of a characterization of its training dataset and a performance score of the corresponding metric. Concretely, in §2, we start with a partly populated table (such as the one from BLI Method Evaluation Set  Table 1: An illustration of the comparability issues across methods and multiple evaluation datasets from the Bilingual Lexicon Induction task. Our prediction model can reasonably fill in the blanks, as illustrated in Section 4. Table 1) and attempt to infer the missing values with the predictor. We begin by introducing the process of characterizing an NLP experiment for each task in §3. We evaluate the effectiveness and robustness of NLPERF by comparing to multiple baselines, human experts, and by perturbing a single feature to simulate a grid search over that feature ( §4). Evaluations on multiple tasks show that NLPERF is able to outperform all baselines. Notably, on a machine translation (MT) task, the predictions made by the predictor turn out to be more accurate than human experts. An effective predictor can be very useful for multiple applications associated with practical scenarios. In §5, we show how it is possible to adopt the predictor as a scoring function to find a small subset of experiments that are most representative of a bigger set of experiments. We argue that this will allow researchers to make informed decisions on what datasets to use for training and evaluation, in the case where they cannot experiment on all experimental settings. Last, in §6, we show that we can adequately predict the performance of new models even with a minimal number of experimental records.

Problem Formulation
In this section we formalize the problem of predicting performance on supervised NLP tasks. Given an NLP model of architecture M trained over dataset(s) D of a specific task involving language(s) L with a training procedure (optimization algorithms, learning rate scheduling etc.) P, we can test the model on a test dataset D and get a score S of a specific evaluation metric. The resulting score will surely vary depending on all the above mentioned factors, and we denote this relation as g: S M,P,L,D,D = g(M, P, L, D, D ). (1) In the ideal scenario, for each test dataset D of a specific task, one could enumerate all different settings and find the one that leads to the best performance. As mentioned in Section §1, however, such a brute-force method is computationally infeasible. Thus, we turn to modeling the process and formulating our problem as a regression task by using a parametric function f θ to approximate the true function g as follows: where Φ * denotes a set of features for each influencing factor. For the purpose of this study, we mainly focus on dataset and language features Φ L and Φ D , as this already results in a significant search space, and gathering extensive experimental results with fine-grained tuning over model and training hyperparameters is both expensive and relatively complicated. In the cases where we handle multiple models, we only use a single categorical model feature to denote the combination of model architecture and training procedure, denoted as Φ C . We still use the term model to refer to this combination in the rest of the paper. We also omit the test set features, under the assumption that the data distributions for training and testing data are the same (a fairly reasonable assumption if we ignore possible domain shift). Therefore, for all experiments below, our final prediction function is the following:Ŝ In the next section we describe concrete instantiations of this function for several NLP tasks.

NLP Task Instantiations
To build a predictor for NLP task performance, we must 1) select a task, 2) describe its featurization, and 3) train a predictor. We describe details of these three steps in this section.  Tasks We test on tasks including bilingual lexicon induction (BLI); machine translation trained on aligned Wikipedia data (Wiki-MT), on TED talks (TED-MT), and with cross-lingual transfer for translation into English (TSF-MT); crosslingual dependency parsing (TSF-Parsing); crosslingual POS tagging (TSF-POS); cross-lingual entity linking (TSF-EL); morphological analysis (MA) and universal dependency parsing (UD). Basic statistics on the datasets are outlined in Table 2. For Wiki-MT tasks, we collect experimental records directly from the paper describing the corresponding datasets (Schwenk et al., 2019). For TED-MT and all the transfer tasks, we use the results of Lin et al. (2019). For BLI, we conduct experiments using published results from three papers, namely Artetxe et al. (2016), Artetxe et al. (2017 and Xu et al. (2018). For MA, we use the results of the SIGMORPHON 2019 shared task 2 (McCarthy et al., 2019). Last, the UD results are taken from the CoNLL 2018 Shared Task on universal dependency parsing (Zeman et al., 2018b).
Featurization For language features, we utilize six distance features from the URIEL Typological Database (Littell et al., 2017), namely geographic, genetic, inventory, syntactic, phonological, and featural distance.
The complete set of dataset features includes the following: 1.
where T 1 and T 2 denote vocabularies of any two corpora. 5. Type-Token Ratio (TTR): The ratio between the number of types and number of tokens (Richards, 1987) of one corpus. 6. Type-Token Ratio Distance: where TTR 1 and TTR 2 denote TTR of any two corpora. 7. Single Tag Type: Number of single tag types. 8. Fused Tag Type: Number of fused tag types. 9. Average Tag Length Per Word: Average number of single tags for each word. 10. Dependency Arcs Matching WALS Features: the proportion of dependency parsing arcs matching the following WALS features, computed over the training set: subject/object/oblique before/after verb and adjective/numeral before/after noun. For transfer tasks, we use the same set of dataset features Φ D as Lin et al. (2019), including features 1-6 on the source and the transfer language side. We also include language distance features between source and transfer language, as well as between source and target language. For MT tasks, we use features 1-6 and language distance features, but only between the source and target language. For MA, we use features 1, 2, 5 and morphological tag related features 7-9. For UD, we use features 1, 2, 5, and 10. For BLI, we use language distance features and URIEL syntactic features for the source and the target language. Predictor Our prediction model is based on gradient boosting trees (Friedman, 2001), implemented with XGBoost (Chen and Guestrin, 2016). This method is widely known as an effective means for solving problems including ranking, classification and regression. We also experimented with Gaussian processes (Williams and Rasmussen, 1996), but settled on gradient boosted trees because performance was similar and Xgboost's implementation is very efficient through the use of parallelism. We use squared error as the objective function for the regression and adopted a fixed learning rate 0.1. To allow the model to fully fit the data we set the maximum tree depth to be 10 and the number of trees to be 100, and use the default regularization terms to prevent the model from overfitting.

Can We Predict NLP Performance?
In this section we investigate the effectiveness of NLPERF across different tasks on various metrics. Following Lin et al. (2019), we conduct kfold cross validation for evaluation. To be specific, we randomly partition the experimental records of L, D, C, S tuples into k folds, and use k−1 folds to train a prediction model and evaluate on the remaining fold. Note that this scenario is similar to "filling in the blanks" in Table 1, where we have some experimental records that we can train the model on, and predict the remaining ones.
For evaluation, we calculate the average root mean square error (RMSE) between the predicted scores and the true scores.
Baselines We compare against a simple mean value baseline, as well as against language-wise mean value and model-wise mean value baselines. The simple mean value baseline outputs an average of scores s from the training folds for all test entries in the left-out fold (i) as follows: Note that for tasks involving multiple models, we calculate the RMSE score separately on each model and use the mean RMSE of all models as the final RMSE score.
The language-wise baselines make more informed predictions, taking into account only training instances with the same transfer, source, or target language (depending on the task setting). For example, the source-language mean value baselinê s (i,j) s-lang for j th test instance in fold i outputs an average of the scores s of the training instances that share the same source language features s-lang, as shown in Equation 3: where δ is the indicator function. Similarly, we define the target-and the transfer-language mean value baselines.
In a similar manner, we also compare against a model-wise mean value baseline for tasks that include experimental records from multiple models. Now, the prediction for the j th test instance in the left-out fold i is an average of the scores on the same dataset (as characterized by the language φ L and dataset φ D features) from all other models:

respectively denote the language and dataset features of the test instance.
Main Results For multi-model tasks, we can do either Single Model prediction (SM), restricting training and testing of the predictor within a single model, or Multi-Model (MM) prediction using a categorical model feature. The RMSE scores of NLPERF along with the baselines are shown in Table 3. For all tasks, our single model predictor is able to more accurately estimate the evaluation score of unseen experiments compared to the single model baselines, confirming our hypothesis that the there exists a correlation that can be captured between experimental settings and the downstream performance of NLP systems. The language-wise baselines are much stronger than the simple mean value baseline but still perform worse than our single model predictor. Similarly, the model-wise baseline significantly outperforms the mean value baseline because results from other models reveal much information about the dataset. Task  Model   Wiki-MT TED-MT TSF-MT TSF-PARSING TSF-POS TSF- Table 3: RMSE scores of three baselines and our predictions under the single model and multi model setting (missing values correspond to settings not applicable to the task). All results are from k-fold (k = 5) evaluations averaged over 10 random runs.
Even so, our multi-model predictor still outperforms the model-wise baseline.
The results nicely imply that for a wide range of tasks, our predictor is able to reasonably estimate left-out slots in a partly populated table given results of other experiment records, without actually running the system.
We should note that RMSE scores across different tasks should not be directly compared, mainly because the scale of each evaluation metric is different. For example, a BLEU score (Papineni et al., 2002) for MT experiments typically ranges from 1 to 40, while an accuracy score usually has a much larger range, for example, BLI accuracy ranges from 0.333 to 78.2 and TSF-POS accuracy ranges from 1.84 to 87.98, which consequently makes the RMSE scores of these tasks higher.

Comparison to Expert Human Performance
We constructed a small scale case study to evaluate whether NLPERF is competitive to the performance of NLP sub-field experts. We focused on the TED-MT task and recruited 10 MT practitioners, 2 all of whom had published at least 3 MTrelated papers in ACL-related conferences.
In the first set of questions, the participants were presented with language pairs from one of the k data folds along with the dataset features and were asked to estimate an eventual BLEU score for each data entry. In the second part of the questionnaire, the participants were tasked with making estimations on the same set of language pairs, but this time they also had access to features, and BLEU scores from all the other folds. 3 2 None of the study participants were affiliated to the authors' institutions, nor were familiar with this paper's content. 3 The interested reader can find an example questionnaire  The partition of the folds is consistent between the human study and the training/evaluation for the predictor. While the first sheet is intended to familiarize the participants with the task, the second sheet fairly adopts the training/evaluation setting for our predictor. As shown in Table 4, our participants outperform the mean baseline even without information from other folds, demonstrating their own strong prior knowledge in the field. In addition, the participants make more accurate guesses after acquiring more information on experimental records in other folds. In neither case, though, are the human experts competitive to our predictor. In fact, only one of the participants achieved performance comparable to our predictor.
Feature Perturbation Another question of interest concerning predicting performance is "how will the model perform when trained on data of a different size" (Kolachina et al., 2012a). To test NLPERF's extrapolation ability in this regard, we conduct an array of experiments on one language pair with various data sizes on the Wiki-MT task. We pick two language pairs, Turkish to English (TR-EN) and Portuguese to English (PT-EN) as our testbed for the Wiki-MT task. We sample par-(and make estimations over one of the folds) in the A. allel datasets with different sizes and train MT models with each sampled dataset to obtain the true BLEU scores. On the other hand, we collect the features of all sampled datasets and use our predictor (trained over all other languages pairs) to obtain predictions. The plot of true BLEU scores and predicted BLEU scores are shown in Figure 1. Our predictor achieves a very low average RMSE of 1.83 for TR-EN pair but a relatively higher RMSE of 9.97 for PT-EN pair. The favorable performance on the tr-en pair demonstrates the possibility of our predictor to do feature extrapolation over data set size. In contrast, the predictions on the pt-en pair are significantly less accurate. This is due to the fact that there are only two other experimental settings scoring as high as 34 BLEU score, with data sizes of 3378k (en-es) and 611k (gl-es), leading to the predictor's inadequacy in predicting high BLEU scores for low-resourced data sets during extrapolation. This reveals the fact that while the predictor is able to extrapolate performance on settings similar to what it has seen in the data, NLPERF may be less successful under circumstances unlike its training inputs.

What Datasets Should We Test On?
As shown in Table 1, it is common practice to test models on a subset of all available datasets. The reason for this is practical -it is computationally prohibitive to evaluate on all settings. However, if we pick test sets that are not representative of the data as a whole, we may mistakenly reach un-founded conclusions about how well models perform on other data with distinct properties. For example, models trained on a small-sized dataset may not scale well to a large-sized one, or models that perform well on languages with a particular linguistic characteristic may not do well on languages with other characteristics (Bender and Friedman, 2018).
Here we ask the following question: if we are only practically able to test on a small number of experimental settings, which ones should we test on to achieve maximally representative results? Answering the question could have practical implications: organizers of large shared tasks like SIGMORPHON (McCarthy et al., 2019) or UD (Zeman et al., 2018a) could create a minimal subset of settings upon which they would ask participants to test to get representative results; similarly, participants could possibly expedite the iteration of model development by testing on the representative subset only. A similar avenue for researchers and companies deploying systems over multiple languages could lead to not only financial savings, but potentially a significant cut-down of emissions from model training (Strubell et al., 2019).
We present an approximate explorative solution to the problem mentioned above. Formally, assume that we have a set N , comprising experimental records (both features and scores) of n datasets for one task. We set a number m (< n) of datasets that we would like to select as the representative subset. By defining RMSE A (B) to be the RMSE score derived from evaluating on one subset B the predictor trained on another subset of experimental records A, we consider the most representative subset D to be the one that minimizes the RMSE score when predicting all of the other datasets: Naturally, enumerating all n m possible subsets would be prohibitively costly, even though it would lead to the optimal solution. Instead, we employ a beam-search-like approach to efficiently search for an approximate solution to the best performing subset of arbitrary size. Concretely, we start our approximate search with an exhaustive enumeration of all subsets of size 2. At each following step t, we only consider the best k subsets {D (i) t ; i ∈ 1, . . . , k} into account and discard the rest. As shown in Equation 6  subset, we expand it with one more data point, For tasks that involve multiple models, we take experimental records of the selected dataset from all models into account during expansion. Given all expanded subsets, we train a predictor for each to evaluate on the rest of the data sets, and keep the best performing k subsets {D (i) t+1 ; i ∈ 1, . . . , k} with minimum RMSE scores for the next step. Furthermore, note that by simply changing the arg min to an arg max in Equation 5, we can also find the least representative datasets.
We present search results for four tasks 4 as beam search progresses in Figure 2, with corresponding RMSE scores from all remaining datasets as the y-axis. For comparison, we also conduct random searches by expanding the subset with a randomly selected experimental record. In all cases, the most representative sets are an aggregation of datasets with diverse characteristics such as languages and dataset sizes. For example, in the Wiki-MT task, the 5 most representative datasets include languages that fall into a diverse range of language families such as Romance, Turkic, Slavic, etc. while the least representative ones include duplicate pairs (opposite directions) mostly 4 Readers can find results on other tasks in Appendix B.
involving English. The phenomenon is more pronounced in the TED-MT task, where not only the 5 most representative source languages are diverse, but also the dataset sizes. Specifically, the Malay-English (msa-eng) is a tiny dataset (5k parallel sentences), and Hebrew-English (heb-eng) is a high-resource case (212k parallel sentences).
Notably, for BLI task, to test how representative the commonly used datasets are, we select the most frequent 5 language pairs shown in Table 1, namely en-de, es-en, en-es, fr-en, en-fr for evaluation. Unsurprisingly, we get an RMSE score as high as 43.44, quite close to the performance of the worst representative set found using beam search. This finding indicates that the standard practice of choosing datasets for evaluation is likely unrepresentative of results over the full dataset spectrum, well aligned with the claims in Anastasopoulos and Neubig (2020).
A particularly encouraging observation is that the predictor trained with only the 5 most representative datasets can achieve an RMSE score comparable to k-fold validation, which required using all of the datasets for training. 5 This indicates that one would only need to train NLP models on a small set of representative datasets to obtain reasonably plausible predictions for the rest.

Can We Extrapolate Performance for
New Models?
In another common scenario, researchers propose new models for an existing task. It is both timeconsuming and computationally intensive to run experiments with all settings for a new model. In this section, we explore if we can use past experimental records from other models and a minimal set of experiments from the new model to give a plausible prediction over the rest of the datasets, potentially reducing the time and resources needed for experimenting with the new model to a large extent. We use the task of UD parsing as our testbed 6 as it is the task with most unique models (25 to be exact). Note that we still only use a single categorical feature for the model type.
To investigate how many experiments are needed to have a plausible prediction for a new model, we first split the experimental records equally into a sample set and a test set. Then we randomly sample n (0 ≤ n ≤ 5) experimental records from the sample set and add them into the collection of experiment records of past models. Each time we re-train a predictor and evaluate on the test set. The random split repeats 50 times and the random sampling repeats 50 times, adding up to a total of 2500 experiments. We use the mean value of the results from other models, shown in Equation 7 as the prediction baseline for the leftout model, and because experiment results of other models reveal significant information about the dataset, this serves as a relatively strong baseline: M denotes a collection of models and k denotes the left-out model.
We show the prediction performance (in RMSE) over 8 systems 7 in Figure 3. Interestingly, the predictor trained with no model records (0) outperforms the mean value baseline for the 4 best systems, while it is the opposite case on the 4 worst systems. Since there is no information provided about the new-coming model, the predictions are solely based on dataset and language features. One reason might explain the phenomenonthe correlation between the features and the scores of the worse-performing systems is different from 6 MA and BLI task results are in Appendix C 7 The best and worst 4 systems from the shared task. those better-performing systems, so the predictor is unable to generalize well (ONLP).
In the following discussion, we use RMSE@n to denote the RMSE from the predictor trained with n data points of a new model. The relatively low RMSE@0 scores indicate that other models' features and scores are informative for predicting the performance of the new model even without new model information. Comparing RMSE@0 and RMSE@1, we observe a consistent improvement for almost all systems, indicating that NLPERF trained on even a single extra random example achieves more accurate estimates over the test sets. Adding more data points consistently leads to additional gains. However, predictions on worse-performing systems benefit more from it than for better-performing systems, indicating that their feature-performance correlation might be considerably different. The findings here indicate that by extrapolating from past experiments, one can make plausible judgments for newly developed models.

Related Work
As discusssed in Domhan et al. (2015), there are two main threads of work focusing on predicting performance of machine learning algorithms. The first thread is to predict the performance of a method as a function of its training time, while the second thread is to predict a method's performance as a function of the training dataset size. Our work belongs in the second thread, but could easily be extended to encompass training time/procedure.
In the first thread, Kolachina et al. (2012b) attempt to infer learning curves based on training data features and extrapolate the initial learning curves based on BLEU measurements for statistical machine translation (SMT). By extrapolating the performance of initial learning curves, the predictions on the remainder allows for early termination of a bad run (Domhan et al., 2015).
In the second thread, Birch et al. (2008) adopt linear regression to capture the relationship between data features and SMT performance and find that the amount of reordering, the morphological complexity of the target language and the relatedness of the two languages explains the majority of performance variability. More recently, Elsahar and Gallé (2019)  (2020) explore the functional form of the dependency of the generalization error of neural models on model and data size. We view our work as a generalization of such approaches, appropriate for application on any NLP task.

Conclusion and Future Work
In this work, we investigate whether the experiment setting itself is informative for predicting the evaluation scores of NLP tasks. Our findings promisingly show that given a sufficient number of past training experimental records, our predictor can 1) outperform human experts; 2) make plausible predictions even over new-coming models and languages; 3) extrapolate well on features like dataset size; 4) provide a guide on how we should choose representative datasets for fast iteration. While this discovery is a promising start, there are still several avenues on improvement in future work.
First, the dataset and language settings covered in our study are still limited. Experimental records we use are from relatively homogeneous settings, e.g. all datasets in Wiki-MT task are sentencepieced to have 5000 subwords, indicating that our predictor may fail for other subword settings. Our model also failed to generalize to cases where feature values are out of the range of the training experimental records. We attempted to apply the predictor of Wiki-MT to evaluate on a low-resource MT dataset, translating from Mapudungun (arn) to Spanish (spa) with the dataset from Duan et al.
(2019), but ended up with a poor RMSE score. It turned out that the average sentence length of the arn-spa data set is much lower than that of the training data sets and our predictors fail to gener-alize to this different setting.
Second, using a categorical feature to denote model types constrains its expressive power for modeling performance. In reality, a slight change in model hyperparameters (Hoos and Leyton-Brown, 2014;Probst et al., 2019), optimization algorithms (Kingma and Ba, 2014), or even random seeds (Madhyastha and Jain, 2019) may give rise to a significant variation in performance, which our predictor is not able to capture. While investigating the systematic implications of model structures or hyperparameters is practically infeasible in this study, we may use additional information such as textual model descriptions for modeling NLP models and training procedures more elaborately in the future.
Lastly, we assume that the distribution of training and testing data is the same, which does not consider domain shift. On top of this, there might also be a domain shift between data sets of training and testing experimental records. We believe that modeling domain shift is a promising future direction to improve performance prediction.

Appendix A Questionnaire
An example of the first questionnaire from our user case study is shown below. The second sheet also included the results in 44 more language pairs. We provide an answer key after the second sheet.
Please provide your prediction of the BLEU score based on the language pair and dataset features (the domain of the training and test sets is TED talks). After you finish, please go to sheet v2.

C New Model
In this section, we show the extrapolation performance for new models on BLI, MA and the remaining systems of UD.