How Far Can We Go with Data Selection? A Case Study on Semantic Sequence Tagging Tasks

Although several works have addressed the role of data selection to improve transfer learning for various NLP tasks, there is no consensus about its real benefits and, more generally, there is a lack of shared practices on how it can be best applied. We propose a systematic approach aimed at evaluating data selection in scenarios of increasing complexity. Specifically, we compare the case in which source and target tasks are the same while source and target domains are different, against the more challenging scenario where both tasks and domains are different. We run a number of experiments on semantic sequence tagging tasks, which are relatively less investigated in data selection, and conclude that data selection has more benefit on the scenario when the tasks are the same, while in case of different (although related) tasks from distant domains, a combination of data selection and multi-task learning is ineffective for most cases.


Introduction
Transfer learning is a common approach for training NLP models that scale across different tasks, domains, and languages. One of the challenges in transfer learning is to deal with the data distribution mismatch between the source (D S ) and the target data (D T ) (Rosenstein et al., 2005). One solution to alleviate the impact of the mismatch is using data selection, a process for selecting relevant training instances from the source data. Data selection (DS) has been applied in the context of domain adaptation to address changes in the data distribution for various NLP tasks, such as sentiment analysis and POS Tagging (Ruder and Plank, 2017;Liu et al., 2019;Blitzer et al., 2007;Remus, 2012), machine translation (Axelrod et al., 2011), dependency parsing (Søgaard, 2011) and Named Entity Recognition (NER) (Murthy et al., 2018;Zhao et al., 2018). To our knowledge, all existing previous works apply data selection to different domains, while maintaining the same task.
In this work we aim to investigate the benefit of data selection in a more complex setting, where we have not only different domains (D S = D T ), but also different tasks (T S = T T ). Intuitively, such setting may bring advantage in situations where large training data are available for a source task T S , and we want to exploit such data for a different (although related) target task T T , where much less training is available. We experiment with the situation where T S is Named Entity Recognition (NER) on a general domain, where several datasets are available, and T T is slot tagging (ST) in the context of utterance interpretation for dialogue systems, where much less data is available. Both of the tasks are rarely investigated in data selection and there is no consensus about the benefit of data selection for them.
We propose an experimental framework where we can compare data selection settings with an increasing level of complexity. First, we consider data selection where NER is both the source and target task, and apply transfer learning from different domains: we call this setting Same Tasks from Different Domains (STDD), T S = T T and D S = D T . In a second, more complex setting, we consider NER as the source task and ST as the target: this is called T S = T T and D S = D T , Different Tasks from Different Domains (DTDD). In this scenario, as we have disjoint label space between the source and the target task, we combine the data selection process with multi-task learning (MTL). To our knowledge, this combination has received very little attention in the literature.
We base our work on the data selection framework proposed by Ruder and Plank (2017), and apply it to our experimental settings. Their framework is model-agnostic and has shown significant advantage in sentiment analysis, POS tagging, and parsing. However, it is not obvious to what extent the selection process can actually help on semantic sequence tagging tasks on STDD and DTDD scenarios. The contributions of the paper are the following: (i) we apply previous work to multi-task learning setup to evaluate the effectiveness of data selection in DTDD scenarios; (ii) we systematically compare data selection on settings of increasing complexity, and observe that existing selection metrics do not show clear advantages over baselines in most cases. Nevertheless, data selection has more potential in STDD when source and target are more similar, while combining MTL and data selection for DTDD is ineffective for most cases in our experimental settings in which we have different but related tasks (NER and ST) from relatively distant domains (news and conversational domains).

Data Selection Framework
In general, the goal of data selection is to select an optimal subset of training instances, X * S , from all the available data X S in T S , to be used for training the model for the target task M T T . Given the source data X S = {x S 1 , x S 2 , ..., x S n }, each instance is ranked according to a score S and the top m examples are then used to train M T T .
We apply the data selection approach from Ruder and Plank (2017), based on Bayesian Optimization (BO) (Brochu et al., 2010), to evaluate the effectiveness of data selection on both the STDD and DTDD scenarios. Specifically, for DTDD we combine data selection and multi-task learning. Given X S , the framework performs data selection based on a score S derived from a set of features. The top m examples are then used to train M T T . In case of STDD, the M T T is a single task sequence tagging model, where we use a biLSTM-CRF model (Lample et al., 2016). As for DTDD, M T T is a hard parameter sharing MTL model, which has been applied to many NLP tasks Plank et al., 2016;Changpinyo et al., 2018;Schulz et al., 2018). The performance on the validation set of the target task is then used by the BO optimizer to update the weight of the scoring features.
Following Ruder and Plank (2017), the selection process is based on a score S computed as the linear combination of weighted features, which include both similarity and diversity features: S θ (x) = θ · φ(x), where θ represents the weight for each feature and φ(x) denotes the feature values of each instance x. The features are calculated between the representation of X S instances and X T . We use term distribution as the representation of the instances. We use the same similarity and diversity measures as Ruder and Plank (2017). The weights θ are learned through BO by taking into account the performance on the validation set when selecting a particular subset of X S . The score S is computed for each x in X S , and then the top m examples are selected for training the M T T model. The loss value L from the M T T in the validation set is used by BO as a feedback to select the next points for θ.

Experiments
We systematically investigate how data selection is effective when applied on both the STDD and DTDD scenarios. We address two semantic sequence labeling tasks: Named Entity Recognition (NER) and slot tagging (ST).

Datasets
For NER we use the OntoNotes 5.0 (Pradhan et al., 2012) dataset, which consists of several sections: newswire (NW), talkshows broadcast (BC), telephone conversation (TC), news broadcast (BN), articles from web sources (WB), and articles from magazines (MZ). We use different OntoNotes sections as different domains in our experiments.
As for ST we use three datasets: ATIS (Price, 1990), MIT-R, and MIT-M (Liu et al., 2013), that are widely used as benchmarks for spoken language understanding. Each dataset contains utterances annotated with domain-specific slot labels, which are typically more fine-grained than NER labels. For example, in the utterance "show me all Delta flights from Milan to New York", the bold words are tagged as airline name, fromloc, and toloc respectively. The overall statistics of each dataset are shown in Table 1.

Data Selection Configurations
We make use of the selection framework described in Section 2, and apply three Bayesian Optimization data selection (BODS) configurations, according to whether we use features both for similarity and diversity (DS sim,div ), similarity features only (DS sim ), or diversity features only (DS div ). We compare the three configurations with the following baselines: • All source, which uses all the data from T S . • Random, which selects random data from T S . • DS map,full . We provide a manual mapping from NER labels to ST labels (Appendix A). A sentence from T S is selected is if all the NER occurrences have a mapping to a slot in T T . • DS map,partial . A sentence from T S is selected if at least one of the NER occurrences in the sentence has a mapping to a slot label in T T .

Settings
We follow most of the hyperparameters 1 as recommended by Reimers and Gurevych (2018). We train the model for T S and T T in an alternating fashion. We use early stopping on the dev. performance of T T . For the model performance evaluation, we calculate the F1-score using the standard CoNLL script 2 . For all experiments, we report the average F1 score results from 10 runs with different seeds. We follow Ruder and Plank (2017) for most configurations of the optimizer, and run 50 iterations. For both the STDD and DTDD scenarios, we select top 50% 3 examples from X S . For MTL we adapt the implementation from Reimers and Gurevych (2017), extending the Bayesian Optimization data selection framework from Ruder and Plank (2017) to support MTL.

STDD Scenario: T S = T T , D S = D T
This is scenario is the same setup as Ruder and Plank (2017), where we use the same tasks both for the source and the target task from different domains, except that we apply the data selection to a semantic sequence tagging task namely NER. In this scenario, we use NER both for the source and the target task. The target domain is one three OntoNotes sections namely NW (news), TC (telephone conversation) and BC (mixed of conversation and broadcast) while as source domain (D S ) we use all available sections in OntoNotes except the one used as the target domain. We only use 10% of training data for the target domain to simulate limited data settings. At the end of the data selection process, we select the top 50% sentences from D S using the best feature weights learned with the Bayesian Optimizer.
Table 2(a) compares the performance of the baselines with the selection-based approaches. In general, we do not observe clear advantages of data selection methods over the baselines, especially the all source data baseline. Using all source data yields the most competitive results almost in all cases. The only case in which DS surpasses the all source baseline is on the BC domain but only for a tiny gain. For NW and BC domains, some DS methods show clear advantages over the random baseline, but still worse than using all source data.
We want to see whether the distance between domains may characterize the performance of the data selection. For this purpose we quantify the domain similarity between each pair D S and D T with Jensen Shannon Divergence (JSD) (Lin, 1991). We compute the JSD between the term distribution of D S and D T . The average JSD of each target task with respect to the source tasks are 0.80 (TC), 0.86 (NW), and 0.87 (BC) 4 . We observe that the higher the JSD is, the more beneficial is the data selection for the target task. BC, which has the highest JSD average, benefits the most from the data selection. On the other hand, TC with the lowest average similarity, has the largest gap between the baseline and the best DS methods (−1.7 F1 point).
Based on our experiments, for the STDD scenario we observe that: 1. In most of the cases, DS methods are inferior to the all source baseline. Yet, it is clear that each domain has a different selection metric configuration that performs the best. This observation suggests that the hypothesis from Ruder and Plank (2017) i.e., different tasks or even different domains demand a different notion of selection metric, is also applicable to semantic sequence tagging tasks such as NER. 2. The gap between the best DS method and the baseline for each D T can be characterized from the average JSD similarity to its D S . Being  more similar to other D S is a more suitable situation to get benefit from data selection.

DTDD Scenario
In this scenario we intend to observe whether data selection adds benefit to MTL. As in the STDD case, data selection is performed on the auxiliary task, where data is assumed to be abundant, and we only use a small portion of data for the target task. We use NER as the auxiliary task and ST as the target task. Prior work from Louvan and Magnini (2019) shows that NER is helpful for ST through MTL, although it is not clear whether adding data selection is beneficial. We follow the setup in Louvan and Magnini (2019), where OntoNotes NW is used as the auxiliary task, and the target task is one of the ST datasets with only 10% of available training data. Observing the results in Table 2(b), in all the cases the baselines, namely all source data and random selection, perform better than MTL with DS methods. The selection methods based on manual label mapping, DS map , do not bring advantage over all source data. Therefore, given two distant D S and D T , selecting sentences based on the label mapping does not help. Moreover, as random selection gives good results as well for most scenarios, this indicates that data selection is not beneficial in our experimental setting that combines data selection and MTL.
Our findings and lessons learned for DTDD are the following: 1. We observe that MTL performs better than single-task learning (STL) for low-resource slot tagging, confirming the finding from Louvan and Magnini (2019). However, adding data selection for MTL is ineffective in our DTDD experimental setup. We hypothesize that MTL learns good common feature representations across tasks, this way inherently helping the model to focus on relevant features even from noisy data in T S . In addition to that, due to data sparsity in limited training, using all the training data works better because the model may learn a better text representation (sentence encoder). Recent similar work from Schröder and Biemann (2020) which uses information theoretic based for estimating the usefulness of an auxiliary task for MTL also found that for semantic sequence tagging tasks such as NER and argument mining, it is less clear when a particular dataset is useful as an auxiliary task. 2. Data selection typically produces selected sentences with concentrated similarity distribution 5 . Therefore, it is probably ineffective when the sentence similarity distribution between T S and T T is already concentrated on a very narrow range.

Conclusion
In this paper we investigated the benefit of data selection for transfer learning in several scenarios of increasing complexity. We apply an existing model-agnostic state of the art data selection framework, and carried on experiments on two semantic sequence tagging tasks, NER and Slot Tagging, and two transfer learning scenarios, STDD (Same Tasks Different Domains), and DTDD (Different Tasks Different Domains). For the STDD scenario, selection methods show potential when the target domain has the highest similarity to the source domains, based on Jensen Shannon Divergence. As for the DTDD scenario in which we use related tasks (NER and ST) from distant domains (news and conversational domains), using selection does not bring advantage over using all the source data. A possible cause is that, because of data sparsity on the target task, it is only by injecting more source data that we can improve the model. Finally, MTL does not benefit from data selection, as it may already effectively help the model to focus on relevant features even though in the presence of noisy data from distant domains.