Multi-Task Active Learning for Neural Semantic Role Labeling on Low Resource Conversational Corpus

Most Semantic Role Labeling (SRL) approaches are supervised methods which require a significant amount of annotated corpus, and the annotation requires linguistic expertise. In this paper, we propose a Multi-Task Active Learning framework for Semantic Role Labeling with Entity Recognition (ER) as the auxiliary task to alleviate the need for extensive data and use additional information from ER to help SRL. We evaluate our approach on Indonesian conversational dataset. Our experiments show that multi-task active learning can outperform single-task active learning method and standard multi-task learning. According to our results, active learning is more efficient by using 12% less of training data compared to passive learning in both single-task and multi-task setting. We also introduce a new dataset for SRL in Indonesian conversational domain to encourage further research in this area.


Introduction
Semantic Role Labeling (SRL) extracts predicateargument structures from sentences (Jurafsky and Martin, 2006). It tries to recover information beyond syntax. In particular, information that can answer the question about who did what to whom, when, why and so on (Johansson and Nugues, 2008;Choi et al., 2010).
There have been many proposed SRL techniques, and the high performing models are mostly supervised (Akbik and Li, 2016;Punyakanok et al., 2004). As they are supervised methods, the models are trained on a relatively large annotated corpus. Building such corpus is expensive as it is laborious, time-consuming, and usually requires expertise in linguistics. For example, Prop-Bank annotation guideline by Choi et al. (2010) is around 90 pages so it can be a steep learning curve even for annotators with a linguistic background. This difficulty makes reproducibility hard for creating annotated data especially in low resource language or different domain of data. Several approaches have been proposed to reduce the effort of annotation. He et al. (2015) introduced a Question Answering-driven approach by casting a predicate as a question and its thematic role as an answer in the system. Wang et al. (2017b) used active learning using semantic embedding. Wang et al. (2017a) utilized Annotation Projection with hybrid crowd-sourcing to route between hard instances for linguistic experts and easy instances for non-expert crowds.
Active Learning is the most common method to reduce annotation by using a model to minimize the amount of data to be annotated while maximizing its performance. In this paper, we propose to combine active learning with multi-task learning applied to Semantic Role Labeling by using a related linguistic task as an auxiliary task in an end-to-end role labeling. Our motivation to use a multi-task method is in the same spirit as (Gormley et al., 2014) where they employed related syntactic tasks to improve SRL in low-resource languages as multi-task learning. Instead, we used Entity Recognition (ER) as the auxiliary task because we think ER is semantically related with SRL in some ways. For example, given a sentence: Andy gives a book to John, in SRL context, Andy and John are labeled as AGENT and PATIENT or BENEFACTOR respectively, but in ER context, they are labeled as PERSON. Hence, although the labels are different, we hypothesize that there is some useful information from ER that can be leveraged to improve overall SRL performance.
Our contribution in this paper consists of two parts. First, we propose to train multi-task active learning with Semantic Role Labeling as the primary task and Entity Recognition as the auxiliary task. Second, we introduce a new dataset and annotation tags for Semantic Role Labeling from conversational chat logs between a bot and human users. While many of the previous work studied SRL on large scale English datasets in news domain, our research aims to explore SRL in Indonesian conversational language, which is still underresourced.

Related Work
Active learning (AL) (Settles, 2012) is a method to improve the performance of a learner by iteratively asking a new set of hypotheses to be labeled by human experts. A well-known method is Pool-Based AL, which selects the hypotheses predicted from a pool of unlabeled data (Lewis and Gale, 1994). The most informative instance from hypotheses is selected and added into labeled data. The informativeness of an instance is measured by its uncertainty, which is inversely proportional to the learner's confidence of its prediction for that instance. In other words, the most informative instance is the one which the model is least confident with.
There are two well-studied methods of sequence labeling with active learning. The first one is maximum entropy: given an input sentence x, the probability of word x t having tag y t is given by Where θ denotes a model parameters and K is the number of tags. Uncertainty in maximum entropy can be defined using Token Entropy (TE) as described in (Settles and Craven, 2008;Marcheggiani and Artières, 2014).
From token level entropy (TE) in (2), we used a simple aggregation such as summation to select an instance. So that instance x is selected by Equation (3) as least confident sample, where T t=1 (.) is a summation term for greedy aggregation of sentence level entropy.
Another well-studied sequence labeling method with active learning is Conditional Random Fields (CRFs) by Lafferty et al. (2001), where the probability of a sequence label y = {y 1 , y 2 , .., y T } given a sequence of observed vectors x = {x 1 , x 2 , .., x T } and a joint log-likelihood function of unary and transition parameter ψ(y t−1 , y t , x t ) is defined as Uncertainty in conditional random fields can be obtained by Viterbi decoding by selecting instance with maximum p(y|x) from a pool of unlabeled instances as defined below.
x VE = arg min where p(y |.) is a probability assigned by Viterbi inference algorithm (Marcheggiani and Artières, 2014).
Multi-Task Learning Instead of training one task per model independently, one can use related labels to optimize multiple tasks in a learning process jointly. This method is commonly known as Multi-Task learning (MTL) or as Parallel Transfer Learning (Caruana, 1997). Our motivation to use multi-task learning is to leverage "easier" annotation than Semantic Roles to regularize model by using related tasks. Previous work on Multi-Task learning on Semantic Role Labeling by Collobert et al. (2011) did not report any significant improvement for SRL task. A recent work (Marasovic and Frank, 2017) used SRL as the auxiliary task with Opinion Role Labeling as the main task.
Multi-Task Active Learning Previous work on multi-task active learning (MT-AL) (Reichart et al., 2008) was focused on formulating a method to keep the performance across a set of tasks instead of a single task. In multi-task active learning scenario, optimizing a set of task classifiers can be regarded as a meta-protocol by combining each task query strategy into a single query method. In one-sided task query scenario settings, Figure 1: Model Overview. Four layers Highway LSTM. SRL task used Conditional Random Fields (CRF) for sequence labeling output. one selected task classifier uncertainty strategy is used to query unlabeled samples. In multiple task scenario, the uncertainty of an instance is the aggregate of classifiers uncertainties for all tasks.

Proposed Method
In this section, we explain on how we incorporated both the AL and MTL in our neural network architecture. We used the state-of-the-art SRL model from He et al. (2017) as our base model as shown in Figure 1.
Our model is a modification of He et al.'s work. Our first adjustment is to use CRF as the last layer instead of softmax because of its notable superiority found by Reimers and Gurevych (2017) for both role labeling and entity recognition. In this scenario, we used CRF layer for the primary task (SRL) (Zhou and Xu, 2015) and softmax layer for the auxiliary task. The auxiliary task acts as a regularization method (Caruana, 1997). Second, we used character embedding with Convolutional Neural Networks as Characters Encoder (Ma and Hovy, 2016), to handle out-of-vocabulary problem caused by misspelled words, slangs, and abbreviations common in informal chatting, as well as word embedding and predicate indicator feature embedding as the input features for a Highway LSTM.
In multi-task learning configuration, we used parameter sharing in embedding and sequence encoder layers except for the outermost module which is used for prediction for each specific task. We optimized the parameters jointly by minimizing the sum loss of L(y s , y e |x, θ, ψ) = L(ŷ s , y s |x, θ)+L(ŷ e , y e |x, ψ), where the first part Multi-Task Active Learning In multiple task scenario, we used the rank combination by Reichart et al. (2008) that combines each task query strategy into an overall rank( Note that in both training one-sided and combined rank multi-task active learning, we returned all task gold labels to be trained in multitask models.
As a multi-task active learning baseline, instead of one-sided AL which queries a pre-determined task for all-iteration, we used random task selection to draw which task to use as the query strategy in the i-th iteration. Random task selection is implemented using random multinomial sampling. The selected task is used for the query instances using standard uncertainty sampling.

Dataset
This research presents the dataset of human users conversation with virtual friends bot 2 . The annotated messages are user inquiries or responses to the bot. Private information in the original data such as name, email, and address will be anonymized. Three annotators with a linguistic background performed the annotation process. In this work, we used a set of semantic roles adapted for informal, conversational language. Table 1 shows some examples of the semantic roles. The dataset consists of 6057 unique sentences which contain predicates.
The semantic roles used are a subset of Prop-Bank (Palmer et al., 2005). Also, we added a new role, GREET. In our collected data, Indonesian people tend to call the name of the person they are talking to. Because such case frequently co-occurs with another role, we felt the need to differentiate this previously mentioned entity as a new role. For example, in the following sentence: Hi Andy! I brought you a present can help refers "you" role as PATIENT to "Andi" role as GREET instead of left unassigned.
In our second task, which is Entity Recognition (ER), we annotated the same sentence after the SRL annotation. We used common labels such as PERSON, LOCATION, ORGANIZATION, and MISC as our entity tags. Different from Named Entity Recognition (NER), ER also tag nominal objects such as "I", "you" and referential locations like "di sana (over there)". While this tagging might raise a question whether there are overlapping tags with SRL, we argue that entity labels are less ambiguous compared to role arguments which are dependent on the predicate. An example of this case can be seen in Table 1, where both of I and you are tagged as PERSON whereas the roles are varied. In this task, we used semiautomatic annotation tools using brat (Stenetorp et al., 2012). These annotation were checked and fixed by four people and one linguistic expert.

Experiment Scenario
The purpose of the experiment is to understand whether multi-task learning and active learning help to improve SRL model performance compared to the baseline model (SRL with no AL scenario). In this section, we focus on several experiment scenarios: single-task SRL, single-task SRL with AL, MTL, and MTL with AL. Model Architecture Our model architecture consists of word embedding, character 5-gram encoder using CNN and predicate embedding as inputs, with 50, 50, and 100 dimension respectively. These inputs are concatenated into a 200dimensional vector which then fed into two-layer Highway LSTM with 300 hidden units.

Initialization
The word embedding were initialized with unsupervised pre-trained values obtained from training word2vec (Mikolov et al., 2013) on the dataset. Word tokens were lowercased, while characters were not.  Training Configurations For training configurations, we trained for 10 epochs using AdaDelta (Zeiler, 2012) with ρ = 0.95 and = 1.e−6. We also employed early stopping with patience set to 3. We split our data using 80% training, 10% validation, and 10% test for the fully supervised scenario. For the active learning scenario, we further split the training data into labeled and unlabeled data. We used two kinds of split, 50:50 and 85:15. For the 50:50 scenario, we queried 100 sentences for each epoch. For the 85:15 scenario, we used a smaller query of 10 sentences in an epoch to keep the number of queries less than the number of available fully supervised training data in 10 epochs. This number of queried sentences was obtained by tuning on the validation set.
As for the AL query method, in the single-task SRL, we used random and uncertainty sampling query. SRL with 100% training data and SRL with random query serve as baseline strategies. In the MTL SRL, we employed random task and ranking.  in Table 2. Our baseline multi-task (SRL+ER with no AL scenario) learning model in this experiment has a higher precision compared to the single-task (SRL) model. From the initial 85% of labeled training data scenario, our model in total requested 87% of the training data in 10 epochs. In this scenario, our proposed method for multi-task active learning using ranking combination can outperform the single-task active learning models. Figure 2 presents the F1 score learning curve for each model.
Significance test We performed two tails significance test (t-test) by using 5-fold cross validation from the training and the test parts of the corpus. The multi-task learning model is better compared to the single-task learning one (p < 0.05). However, the single-task and the multi-task learning scenario are not significantly better than both multi-task active learning from 85% and 50% training data scenario, since the p-value between model pairs are greater than 0.05. Therefore, accepting the null hypothesis indicate that performances between multi-task active learning with 50%/85% initial data and multi-task or single-task with full dataset are comparable. We draw a confusion matrix of the multi-task active learning model with 85% initial training data in Figure 3 to analyze our model performance. We observe several common errors made by the model. The largest mistakes from the matrix are PATIENT false positive. The model incorrectly detected 59% of non-roles as PATIENT. Another prominent error is 21% false negative of total gold roles. The model primarily failed to tag 37% of gold BENEFACTOR and 35% of gold TIME. Quite different from the English SRL, we found that labels confusion happens less frequently than other error types. Based on this percentage, we investigated the error by drawing samples. In general, we broke down the incorrect predictions into several types of error.

False Negative Spans
False negatives in semantic role labeling are defined as the number of roles in the gold data that do not have the corresponding span matches in the model prediction. False negative for AGENT encompasses 69% of errors from the total of 45 AGENT gold span role errors, while the errors in TIME roles all occur in this error type. In Table  4, the left example shows that, the model failed to tag "ini komputer" (EN: This is a computer). In the right example, the model did not recognize "get rick nya 3 " as PATIENT. An interesting remark is perhaps how the model failed to tag because the predicate is an unknown word in the training vocabulary despite the use of characters encoder to alleviate the out-of-vocabulary problem. While in the left example, predicate "menjawab" is also an unknown word in the vocabulary    but not a mistyped word, the right sample's predicate "di donlot" is an informal writing of the word "download". In the 50% training data scenario, we found that multi-task active learning model achieves less recall compared to the single-task active learning model. The multi-task active learning with 50% initial training data performance suffers from failing to tag 53% of BENEFACTOR label.

Boundary Error
Overall, we found that boundary errors contribute to 22% of the total span exact match errors. For example, we found that PATIENT boundary errors mostly occurred because predicted role spans do not match the continuation of subsequent role. As shown in Table 5, the model failed to recognize makanan (EN: food) as the continuation of info (EN: info) from the top example. In the bottom example, the model failed to predict the continuation of a mistyped role "sahabar".

Role Confusion
Role confusion is defined as the matching between gold span and predicted span, but they have different labels. This error typically occurs the least compared to the false negatives and boundary errors. In total, it is only 7% of the total errors. The most common incorrect prediction is between gold PATIENT and prediction AGENT. As shown in Table 6 in the top sentence, the model incorrectly labeled a PATIENT (Jemma) as an AGENT. Additionally, the model also incorrectly tagged BENE-FACTOR as PATIENT. In the bottom sentence, the word "Aku" (EN: I) is not annotated as any roles but detected as an AGENT by the model.

Conclusion & Future Work
In this paper, we applied previous state-of-the-art deep semantic role labeling models on a low resource language in a conversational domain. We propose to combine multi-task and active learning methods into a single framework to achieve competitive SRL performance with less training data, and to leverage a semantically related task for SRL.
Our primary motivation is to apply the framework for low resource languages in terms of dataset size and domains.
Our experiments demonstrate that active learning method performs comparably well to the single-task baseline using 30% fewer data by querying a total of 3483 from 4845 sentences. This result can be increased further marginally to outperform the baseline using 87% of the training data. Our error analysis reveals some different obstacles from English SRL to work on in the future. While He et al.'s model of deep layers of highway LSTM allows learning the relation between a predicate and arguments explicitly, not all tasks in multi-task learning have equal complexity that needs deep layers. Søgaard and Goldberg (2016) proposed a method to allow a model to predict tasks with different complexities at different layer depths. For example, predicting entity recognition tag at lower layers or inserting predicate features at higher layers in an LSTM, because entity recognition does not need predicates as features and is considered as a lower-level task compared to SRL.
Combining multi-task learning with an unsupervised task such as language modeling (Rei, 2017) is also a possible improvement in multi-task active learning settings as a semi-supervised variant. Analyzing other active learning methods such as query by committee, variance reduction (Settles and Craven, 2008), and information density (Wang et al., 2017b) in multi-task settings are also a promising path in deep learning architectures.