Multi-Task Learning for Argumentation Mining in Low-Resource Settings

We investigate whether and where multi-task learning (MTL) can improve performance on NLP problems related to argumentation mining (AM), in particular argument component identification. Our results show that MTL performs particularly well (and better than single-task learning) when little training data is available for the main task, a common scenario in AM. Our findings challenge previous assumptions that conceptualizations across AM datasets are divergent and that MTL is difficult for semantic or higher-level tasks.


Introduction
Computational argumentation mining (AM) deals with the automatic identification of argumentative structures within natural language.This can be beneficial in many applications such as summarizing arguments in texts to improve comprehensibility for end-users, or information retrieval and extraction (Persing and Ng, 2016).A common task is to segment a text into argumentative and nonargumentative components and identify the type of argumentative components.As an illustration, consider the (simplified) example from Eger et al. (2017): "Since [it killed many marine lives] Premise [tourism has threatened nature] Claim ."Here, the non-argumentative token "Since" is followed by two argumentative components: a premise that supports a claim.
Argumentation is highly subjective and conceptualized in different ways (Peldszus and Stede, 2013;Al-Khatib et al., 2017).On the one hand, this implies that creating reliable ground-truth datasets for AM is costly, as it requires trained annotators.However, even trained annotators have problems identifying and classifying arguments reliably in texts (Habernal and Gurevych, 2017).To tackle AM in a new domain or develop new AM tasks, it may thus not be possible to create large datasets as required by most state-of-the-art machine learning approaches.On the other hand, the different conceptualizations of argumentation resulted in AM corpora with different argument component types, with very little conceptual overlap between some of these corpora (Daxenberger et al., 2017).This distinguishes AM from more established NLP tasks like discourse parsing (Braud et al., 2016) and makes it particularly challenging.Therefore, a natural question is how to handle new AM datasets in a new domain and with sparse data.
Here, we investigate how existing AM datasets from different domains and with different conceptualizations of arguments can be leveraged to tackle these challenges.More precisely, we study whether conceptually diverse AM datasets from different domains can help deal with new AM datasets when data is limited.A promising direction to incorporate existing datasets as "auxiliary knowledge" is by means of multi-task learning (MTL), a paradigm that dates back to the 1990s (Caruana, 1993(Caruana, , 1996)), but has only recently gained large attention (Collobert et al., 2011;Søgaard and Goldberg, 2016;Hashimoto et al., 2017).The idea behind MTL is to learn several tasks jointly, similarly to human learning, so that tasks serve as mutual sources of "inductive bias" for one another.MTL has been reported particularly beneficial when tasks exhibit "natural hierarchies" (Søgaard and Goldberg, 2016) or when the amount of training data for the main task is sparse (Benton et al., 2017;Augenstein and Søgaard, 2017), where the auxiliary tasks may act as regularizers to prevent overfitting (Ruder et al., 2017).The latter is precisely the scenario most relevant to us.
In this paper, we (1) investigate to which de- MTL has been applied in many different settings.Bollmann and Søgaard (2016) and Peng and Dredze (2017) use data from different domains as different tasks and thereby improve historical spelling normalization and Chinese word segmentation and NER, respectively.Plank et al. (2016) apply an MTL setup to POS tagging across 22 different languages, where the auxiliary task is to predict token frequency.Eger et al. (2017) explore sub-tasks (such as component identification) of a complex AM tagging problem (including relations between components) as auxiliaries and find that this improves performances.However, they stay within one single domain and dataset, and thus their approach does not address the question how new AM datasets with sparse data can profit from existing AM resources.Conceptually closest to our work, Braud et al. (2017) leverage data from different languages as well as different domains in order to improve discourse parsing.While MTL was shown effective for syntactic tasks under certain conditions (Søgaard and Goldberg, 2016), Alonso and Plank (2017) find that MTL does not improve performances in four out of five semantic (i.e., higher level) tasks that they study.We are among the first to perform a structured investigation of MTL for higher-level pragmatic tasks, which are thought to be much more challenging than syntactic tasks (Alonso and Plank, 2017), and in particular, explore it for AM in cross-domain settings.

Experiments
Data We experiment with six datasets for argument component identification, i.e. the token-level segmentation and typing of components.These datasets are all of different sizes, have different average text lengths, and different argument component types and label distributions, as summarized in Table 1.We only choose datasets containing both argumentative components and nonargumentative text.Claims are available in five of six datasets, and all datasets have premises (resp."justification"), although it is unclear how large the conceptual overlap is across datasets.Further component types are idiosyncractic.hotel has the largest number of types, namely, six.Most datasets also come with further information, e.g. relations between argument components, which are not considered here.
Approach Due to the difference in annotations used in the different datasets, we consider each dataset as a separate AM task.We treat all of them as sequence tagging problems, where predicting BIO tags (argument segmentation) and argument component types (component classification) is framed as a joint task.This is achieved through token-level BIO tagging with the label set {O} ∪ {B, I} × T , where T is a dataset specific set of argument component types, e.g.T = {claim, premise, . ..}.Thus, the overall number of tags in each dataset is twice the number of non-"O" component types plus one (2 • |T | + 1).We use the state-of-the-art framework by Reimers and Gurevych (2017)  (auxiliary) tasks are trained on the same number of (randomly drawn) mini-batches.
To simulate data sparsity, we experiment with different sizes of training data for the main task.We first draw a "sparse" training set of 21K tokens2 for each of the six AM datasets and a dev set of 9K to simulate a sparse scenario with 30K given tokens.The remaining data of each specific dataset is used as its test set (at least 5K tokens).We then randomly draw a subset of the training data to create three more 'sparsity scenarios' with 12K, 6K, and 1K tokens, respectively.Both dev and test set remain the same as in the 21K scenario.It is worth emphasizing how little data is used in the 1K scenario-only 1-10 documents (or roughly 50 sentences).We train a separate STL system for each of the six datasets and each of the four sparsity scenarios.In the MTL setup, the respective sparsity data is used as the main task, all other (auxiliary) AM datasets, each considered a separate task, are trained on all their available data.To measure the effect of MTL as opposed to a mere increase of training data, we furthermore train for each main task (i.e. each dataset and sparsity scenario) an STL system on the union of (training data of) main and auxiliary task, and evaluate it on the main task's test data.
Hyperparameter optimization For each sparsity scenario and dataset we train 50 STL/MTL systems using GloVe embeddings (Pennington et al., 2014) and 50 using the embeddings by Komninos and Manandhar (2016).For each run we randomly choose a layout with either one hidden layer of h ∈ {50, 100, 150} units or two layers of 100 units as well as variational dropout rates between 0.2 and 0.5 for the input layer and for the hidden units.

Results
Note that we experiment with artificially shrunk datasets, which makes our results incomparable with those reported for the full datasets in other works.Nevertheless, it is to be expected that our STL model is on par with results obtained in recent works also using neural models for argument component identification, since our state-of-theart BILSTM has the same architecture as the one by Eger et al. (2017).
Overall trends Table 2 reports the average macro-F13 test scores over the respective ten best (according to the macro-F1 dev scores) hyperparameter configurations.We compare STL on each task, MTL with all remaining datasets as auxiliary tasks, and the union baseline.For three of the six datasets, MTL yields a significant improvement in all sparsity scenarios.Interestingly, these are the datasets with only one or two types of argument components.For the other three datasets, MTL only yields an improvement in the sparser data scenarios.The union baseline generally performs (considerably) worse than STL in all scenarios.This implies that the domains and component types (label spaces) used in the different AM datasets are too diverse to model them as one single task and that the improvement of MTL over STL cannot be attributed to more available data.
Figure 1 shows the general trends of our results.For each dataset, the figure plots the difference between normalized MTL and normalized STL macro-F1 scores (MTL norm (k)−STL norm (k)) for k = 1K, 6K, 12K, 21K training data points for the main task.For each specific dataset, the normalized macro-F1 score is defined as ing data.Using this normalization, all scores are directly comparable and have the interpretation of improvement over the STL scenario with 1K tokens.It is noteworthy that MTL always improves over STL when the main task is very sparse (1K) and gains are sometimes substantial (between 30 and 40% for web, essays, and hotel).
We observe three different patterns with respect to the main task: (i) for essays, web, and hotel, MTL is considerably better than STL when the main task is sparse, but for 21K tokens we observe either minimal gains or losses from MTL compared to STL. (ii) The var and news datasets are stable, with consistent small gains from MTL over STL for all sizes of the main task.Finally, (iii) wiki displays an unusual pattern in that MTL gains are increasing with the amount of training data.We attribute this to the large label imbalance in wiki, where nearly 70% of the data is 'O'.When training data is very sparse, STL predicts 99% of all tokens as 'O', which results in a high F1 score for this component type but very low F1 scores (below 1%) for the two other component types.The macro-F1 is thus lower than that of MTL, where 'claim' and 'premise' have a higher F1 score.Even though STL improves on the identification of 'premise' and 'claim' in the 21k scenario, the trend remains, since MTL also improves Detailed analysis Upon closer inspection, we find that across all datasets MTL generally improves performance for class labels with low frequency as compared to STL.The more training data becomes available, the better STL gets in predicting such class labels, thus closing the gap to MTL.However, for wiki even 21K does not seem sufficient for STL to learn the two infrequent class labels, predicting 87% as 'O', so MTL still yields more than 10pp higher F1 for these infrequent classes.
Further analysis of our results reveals that the increase in the overall F1 score for MTL over STL is both due to improved component segmentation (BIO labeling) and better type prediction.For example, in the 21K and 6K data settings, the BIO labeling improves by 1-4pp macro-F1 for nearly all datasets and even by up to 17% for wiki.Unsurprisingly, in most cases, MTL also reduces invalid BIO sequences ('O' followed by 'I').Regarding the F1 scores of argument component types, we observe an improvement of MTL over STL for claims or major claims in all datasets containing these types and for premises in all but one dataset.It is further interesting that for the hotel dataset, MTL confuses premises mainly with the semantically similar implicit premises, whereas STL confuses premises with claims.Moreover, in both hotel and essays, claims are rarely predicted to be major claims, but major claims are predicted to be claims (both STL and MTL).
These results indicate that, despite the differ-ent domains and label spaces of the six datasets, MTL appears to learn generalized cross-domain representations of argument components, which aid argument component identification in sparse data scenarios and across domains.

Concluding Remarks
We showed that MTL improves performance over STL on AM tasks (particularly) when training data is sparse.More precisely, argument component identification on a small AM dataset improves when treating other AM datasets as auxiliary tasks, even if these include different component types, coming from diverse domains.Overall, our results challenge the view that MTL is only infrequently effective for semantic or higherlevel tasks (Alonso and Plank, 2017).We attribute the success of MTL over STL to a few factors in our setting: (1) Alonso and Plank (2017) used syntactic auxiliary tasks for semantic main tasks, whereas we choose only higher-level auxiliary tasks for higher-level main tasks.
(2) The label spaces of all our tasks are relatively small, so that generalized representations can be learned in the LSTMs' hidden layers without suffering from label sparsity.(3) The AM tasks considered here apparently do share common ground, a finding worth mentioning in itself given the contrary evidence in related work (Daxenberger et al., 2017).Our findings cannot be readily anticipated by previous research, which has reached mixed conclusions regarding the effectiveness of MTL overall and particular aspects, such as the size of main task.For example, while Luong et al. (2016) suggest that success of MTL requires that the auxiliary task does not swamp the main task data, Benton et al. (2017) and Yang et al. (2017) come to the opposite conclusion that MTL is particularly effective when the data of the main task is small, and Bingel and Søgaard (2017) find a low correlation between size of the main task and MTL success.Our curves in Figure 1 appear to prefer the view that MTL is effective when the main task training data is sparse.
The scope for future work is vast.For example, it would be interesting to investigate whether standard low-level tasks, such as POS tagging or chunking, are effective for AM.Furthermore, other architectures for multi-task learning that apply soft parameter sharing, such as sluice networks (Ruder et al., 2017), will be investigated.

Figure 1 :
Figure 1: MTL versus STL: curves ∆(k) = MTL norm (k) − STL norm (k) as a function of size k of main task.

Table 1 :
for both single-task learning (STL) and MTL.It employs a bidirectional LSTM (BILSTM) model with a CRF layer over individual LSTM outputs to account for label dependencies.We use nadam as optimizer.For MTL, the recurrent layers of the deep BILSTM are shared by all tasks, with a separate CRF layer for each task.All tasks terminate at the same level.The main task determines the number of mini-batches used for training, i.e. in every iteration the main task is trained on all its mini-batches and all other AM datasets: C -claim, P -premise, O -non-argumentative; numbers in parentheses are label distributions in %; 'tokens' is the average in each document.

Table 2 :
where σ(k) is the original macro-F1 score and STL(1K) denotes the STL score for 1K train-Macro-F1 for AM component identification, comparing MTL, STL (significant differences in bold with p < 0.01, p < 0.05 if * using Mann-Whitney U Test) and union baseline (BL).