A Multitask Active Learning Framework for Natural Language Understanding

Natural language understanding (NLU) aims at identifying user intent and extracting semantic slots. This requires sufficient annotating data to get considerable performance in real-world situations. Active learning (AL) has been well-studied to decrease the needed amount of the annotating data and successfully applied to NLU. However, no research has been done on investigating how the relation information between intents and slots can improve the efficiency of AL algorithms. In this paper, we propose a multitask AL framework for NLU. Our framework enables pool-based AL algorithms to make use of the relation information between sub-tasks provided by a joint model, and we propose an efficient computation for the entropy of a joint model. Experimental results show our framework can achieve competitive performance with less training data than baseline methods on all datasets. We also demonstrate that when using the entropy as the query strategy, the model with complete relation information can perform better than those with partial information. Additionally, we demonstrate that the efficiency of these active learning algorithms in our framework is still effective when incorporate with the Bidirectional Encoder Representations from Transformers (BERT).


Introduction
Voice assistant becomes increasingly intelligent on the mobile phone. Given an utterance from the user, the task of natural language understanding (NLU) aims at identifying the user's intent and extracting semantic information. This can help a voice assistant to convert the utterance to an executable instruction. The task can be separated as two sub-tasks called intent detection (ID) and slot filling (SF) (Tur and De Mori, 2011). Table 1 shows an example of ID and SF using query "Flights from Ontario to Orlando".

Query Flights from
Ontario to Orlando Slots O O B-fromloc O B-toloc Intent atis flight Table 1: example from user query to semantic frame Traditional pipeline models solved the task individually (Haffner et al., 2003;Raymond and Riccardi, 2007;Yao et al., 2014). Considering possible error propagation effect caused by pipeline method, the joint model has been proposed to exploit the relation information between ID and SF (Liu and Lane, 2016;Goo et al., 2018;Chen et al., 2019a;Qin et al., 2019).
Sufficient data becomes another bottleneck in real-world situations. Both ID and SF need a considerable amount of training data to get a satisfying performance. Acquiring annotated data is trivial and expensive. The Active Learning (AL) method offers a promising solution to deal with this bottleneck Settles, 2009). It allows the model to query certain unlabeled samples to annotate for training. For multitask framework, researchers proposed methods to select samples considering This work is licensed under a Creative Commons Attribution 4.0 International License. License details: http:// creativecommons.org/licenses/by/4.0/ * Equal contributions. task priority or dependency (Reichart et al., 2008;Zhang, 2010;Fang and Tao, 2015). In recent years, researchers tried to apply AL algorithms into NLU task. Peshterliev et al. (2019) proposed a CRF-based ensemble AL method for detecting samples from a new domain. Chen et al. (2019b) focused on ID and proposed a method based both on QBC (Seung et al., 1992) and uncertainty sampling (Lewis and Gale, 1994). Dimovski et al. (2018) introduced the submodularity-inspired selection method for SF. However, to the best of our knowledge, most existing researches of AL for NLU did not make use of the relation information between ID and SF from a joint model. This motivates us to build a multitask AL framework for NLU to select samples that can benefit ID and SF jointly. Our contributions in this paper are listed as follows: • We propose a multitask AL framework for NLU. We employ a joint model in our framework to model the relation information between ID and SF, and to provide necessary marginal distribution for pool-based AL algorithms.
• We implement representative pool-based AL algorithms including Least Confidence (Settles, 2009), Margin Sampling (Scheffer et al., 2001) and Entropy (Shannon, 1948) under the multitask scenario. We also propose an efficient computation for the entropy of a joint-model by DP.
• Through simulated experiments, we show that our AL framework can achieve competitive performance with less training data than baseline methods on all datasets.
• We demonstrate that these AL algorithms keep efficient when the joint model is changed to the Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2019).

Related Work
Active learning (AL): AL is a method to improve the performance of a model using as few labeled samples as possible (Settles, 2009). A famous scenario is pool-based AL, which uses different sampling strategies to evaluate informativeness over a large pool of unlabeled data. The main categories are summarized in (Settles, 2009), which introduced many different sampling methods developed for pool-based AL, along with their application on various tasks. Expected Gradient Length (EGL)  and Expected Error Reduction (EER) (Roy and McCallum, 2001) are pool-based AL frameworks based on decision-theoretic methods. EGL aims to select utterances that cause the largest change in new training gradient, and EER tries to select samples which can reduce the expected future error maximally. However, these two methods are computationally expensive in NLU scenario because of a large feature space. Seung et al. (1992) studied another sample selection framework: Query-By-Committee (QBC). It determines the query uncertainty by a committee of models and selects informative samples with most disagreement in the committee. Since it needs a committee of models that are all trained on the labeled set, this will take a lot of time to do training and also need extra memory. Our work mainly focuses on estimating uncertainty sampling which is the simplest and most commonly query strategy algorithm in NLU.
Natural language understanding (NLU): According to whether ID and SF are separated in the model, NLU models can be categorized as independent modeling methods and joint modeling methods. Independent methods include hierarchical attention network (Yang et al., 2016), adversarial multitask learning framework (Liu et al., 2017) which mainly focused on ID. Others for SF include encoder-labeler deep LSTM (Kurata et al., 2016), joint pointer and attention model (Zhao and Feng, 2018).
As for the joint model, recent papers mainly construct neural network (NN) models to make use of the relation information between ID and SF, which include the slot-gated model with attention (Goo et al., 2018), joint BERT (Chen et al., 2019a), bi-directional interrelated model (E et al., 2019) and stack-propagation method (Qin et al., 2019). Additionally,  developed a joint NLU model with a triangular-chain conditional random field (TriCRF), which is a unified probabilistic model combining two related problems. Xu and Sarikaya (2013) proposed a CNN version of the TriCRF model Multitask Active Learning (MTAL): MTAL is proposed to select samples with more information to annotate for all tasks involved. Reichart et al. (2008) focused on using rank combination protocol to rank tasks and then choose the best samples for the chosen task, which can keep the performance for a set of tasks close to single selection for every task. In multitask active learning setting, this method did not leverage inter-task association for selecting instances.
In recent years, several works have been done to evaluate the possibility of applying active learning methods on the NLU task. Chen et al. (2019b) proposed an AL method for automatically selecting the most informative labeling, which mainly focused on intent identification, and Dimovski et al. (2018) introduced the submodularity-inspired selection AL method and apply this selection criteria to the problem of slot filling. Peshterliev et al. (2019) targeted AL for new domains in NLU and explored the majority-CRF algorithm to select live utterances which achieved a statistically significant improvement compared to random sampling and traditional AL methods. Although their method selected samples considering the information from ID and SF at the same time, it did not utilize the inter-task relationship between two tasks.

Multitask Active Learning for NLU
Our main goal is to enable AL algorithms to select samples for both tasks simultaneously. In order to achieve this, the model should be able to directly provide the joint probability of intents and slots given the input sequence. This joint probability contains the aforementioned relation information for ID and SF. Any model that can provide above components can directly fit into our framework.
For the above reasons, we build our joint model from (Jeong and Lee, 2008) and (Huang et al., 2015) as shown in Figure 1. This section briefly introduce the necessary components in our joint model, and show how to apply AL algorithms by these components.
BiLSTM-TriCRF joint model: Jeong and Lee (2008) introduced Triangular Chain Conditional Random Field (TriCRF) for NLU joint model. TriCRF jointly represents the slots and intents in probabilistic graphic model, which can encode their dependencies and uncertainty.
In TriCRF, the potential of input sequence x = (x 1 , ..., x T ), slot sequence y = (y 1 , ..., y T ) and intent z is described as Figure 2: Structure for the recursive computation of Joint Entropy H(y, z|x), the intermediate entropies slot y t , state transition y t and y t−1 , and indent z and slot y t . The conditional distribution p θ (z, y|x) is defined as is the potential of input sequence x with intent z. Referring to (Huang et al., 2015)'s work, the output of BiLSTM F = (f 1 , ...f T ) contains the transition from input to slot at time step t. A transition score matrix [A] i,j means transition from i-th slot to j-th slot for a pair of consecutive time steps. We use an intent-slot transition matrix [B] i,j to contain the transition from i-th intent to j-th slot. Therefore, φ 1 t (y t , x), φ 2 t (y t , y t−1 ) and φ 3 t (z, y t ) were computed as: where λ is the parameter to be learned. In our model, the feature I is computed as I = T i f i to provide the intent information. Using I, ϕ(z, x) is computed as where λ 1 is the parameter to be learned. See Jeong and Lee (2008), Huang et al. (2015) for more details. After getting the output z and y from equation (2), the loss of joint model is defined as where θ = (λ, λ 1 , A, B, W BiLST M ) Query Strategies: Our work mainly lies on the most common method-uncertainty sampling strategies. The best training data will have highest uncertainty in the learner's predictive distribution. We estimate the following general heuristics for NLU model. The first one is Least Confidence(LC) (Settles, 2009) x * LC = argmax Where (y * , z * ) = argmax y,z p θ (y, z|x) is the most likely sequence slots and intent for joint probability distribution.
Algorithm 1 Calculate margin sampling for NLU Input: V ←Viterbi Algorithm V 2 ←Viterbi Top2 Algorithm z = z 1 , z 2 ...z n ← all intents Argmax2 ←Return two largest values and indexs in the input Output: top1score − top2score 1: for z = z 1 , z 2 , ..z n do 2: calculate the most likely state sequence ints for every intent z i using Viterbi Algorithm 3: end for 4: find two largest values (int1s, int2s) and their intent indexes (int1id, int2id) in the ints using Argmax2 5: find the second largest joint probability int1s2 for int1id using Viterbi Top2 Algorithm 6: the best joint probability in all intents top1score is int1s 7: compare int1s2 with int2s, and the bigger one is the second largest joint probability in all intents top2score 8: return top1score − top2score Scheffer et al. (2001) proposed another method called Margin Sampling(M): Here, (y * 1 , z * 1 ), (y * 2 , z * 2 ) are the first and second best sequence slots and intents under the current model respectively. This method aims to select the instance with the smallest margin between the posteriors of its top two most likely sequence slots and intents. This instance can be considered as the learner has much doubt in distinguishing two possible intents and slot paths. When we select best two joint probabilities on sequence slots and intents, we need to consider the joint probability over all intents rather than considering the former two joint probability on one most possible intent.
Considering the increasing number of intents and sequence slots paths, we adopt the Algorithm 1 to accelerate the speed of calculation.
Another more common uncertainty-based method is entropy(H) (Shannon, 1948): Where (y, z) ranges over all possible slots and intents for a sequence. This method represents the information needed to "encode" a joint distribution (y, z), and it can be thought as posteriors of our model over its slots and intents. In practice, since the number of possible labels increases sharply with the length of sequence, the quantity of candidate intents and labels combinations grows exponentially. Previous work by Kim et al. (2006) has employed N-best Sequence Entropy (NSE) that the entropy of the N parses with the highest probabilities. In our scenario, it can be rewritten as: Where N = (y * 1 , y * 2 , ...y * n ), the set of the N-best state sequences in a certain intent. It is an approxima-Algorithm 2 Calculate joint entropy for NLU Input: z = z 1 , z 2 ...z n ← all intents n ←the length of sequence P Z Z ← marginal distribution for intent z ; // The detail in Equation (15) P X z i ← partition function in one intentz i ; // The detail in Equation (16) T AGS = t 1 , t 2 ....t j , starttag, endtag ←all possible tags value for y Output: H(y, z | x) for z = z 1 , z 2 , ..z n do calculate intent entropy intent entropy using P Z z i ; return H(y, z | x) = sequence entropy + intent entropy tion to make computational entropy feasible. Mann and McCallum (2007) and Hernando et al. (2005) introduced an efficient calculation for true sequence entropy in linear-chain CRFs. We use this algorithm to derive Algorithm 2 to calculate the true Joint Entropy Equation (9) for our model: Using this decomposition, we can define a dynamic programming over entropy. We introduce the detailed computation and proof of Algorithm 2 in NLU to the Appendix A.

Experiments
Dataset: Schuster et al. (2019) contributed a dataset which contains 43k labelled English utterances across the three domains: Facebook-Weather, Facebook-Alarm, and Facebook-Reminder 1 . In our experiment, each domain is treated as an independent dataset. ATIS dataset (Tur et al., 2010) is the most commonly used dataset for NLU task, which is consisted of spoken queries on flight-related information 2 . The training, development and test sets contain 4478, 500 and 893 queries respectively. Snips dataset (Coucke et al., 2018) is collected from the Snips personal voice assistant 3 . The training, development and test sets, consist of 13084, 700 and 700 utterances, respectively. The size of the dataset is called D i , Figure 3: Overall accuracy of our framework and baseline framework on five datasets.  Table 2: Detailed values for all query strategies on all evaluation dataset. For each dataset (column), the best result is shown in bold and the second best is shown underlined. and the index i is the name of the datasets.
Framework Architecture The hidden size of BiLSTM-TriCRF is 256 and the batch size is 512. Word embedding is implemented directly by PyTorch. We use Adam (Kingma and Ba, 2015) as the optimizer and the learning is set as 0.01. Every selected data consumes one annotating budget. We call the number of samples annotated every round as query size. For each dataset, the query size is set as D i /100. Active Learning Baseline: Two schemes are implemented and compared in our experiments: (a) pipeline model which does not consider relation information between ID and SF, and (b) passive random sampling, which randomly select samples from the unlabeled data pool. We refer to Majority-CRF (Peshterliev et al., 2019) for building the pipeline model. Their framework uses Least Confidence as the basic query strategy and uses the QBC method to improve the quality of the query strategy. Additionally, they reported that their method can be reusable if the models are changed. We adjust several settings in this pipeline model to make it more suitable to compare in our experiment. The samples selected from the pipeline framework are then provided to the same NLU joint model in our framework.
Framework Performance: For each of the datasets, we use all data to test the performance of different active learning algorithms. All algorithms start training from a seed-set by random selection. For each dataset, different AL algorithms should eventually achieve the same performance since the data and the training model are the same. We treat the final overall accuracy as a performance baseline and record the least quantity of samples needed for different AL algorithms to achieve this baseline. We call this quantity as Q needed in our experiment. The percentage of Q needed on different dataset is computed as Q needed /D i . For different AL method, we compute the percentage change comparing to the random sampling. We notice that the accuracy curve does not monotonically increase when the amount of annotations increases. Therefore using Q needed may introduces randomness into the results. To mitigate this impact, we also use the area under curve (AUC) as a metric. Figure 3 shows the results on 5 datasets and Table 2 shows detailed values for Q needed , percentage and percentage change. Table 3 shows the detailed AUC value for all datasets in Appendix B.
From experiment results, Q needed and AUC show similar tendency on five datasets. All AL algorithms including Majority-CRF perform better than the random sampling on most datasets. Majority-CRF achieves about 9% percentage decrease on average, while our methods can achieve more than 40%. This indicates that our framework with the relation information between ID and SF does have a positive impact on these pool-based AL algorithms. We observe that Majority-CRF does not work better than Random Sampling on Facebook-Alarm dataset for Q needed . A possible reason is that Majority-CRF is originally designed for acquiring data from a new domain. It is more probable to be influenced by the distribution of the dataset. The result also shows that the exact computation of entropy does bring better performance than the approximation, although it also costs more computation time.
Influence of Multitask: This experiment aims at investigating what difference would be made when MTAL considers only single task rather than multitask. We use Entropy as the basic query strategy. AL strategies here only consider slots or intents information for computation, and provide these samples to the same joint-model in our framework. We call them Intent Entropy and Slot Entropy in our experiments. Detailed computation is shown in Appendix A. Figure 4 shows the results on 3 datasets from Facebook.
From the figure, the results of three computation of entropies show little difference at the early stage of training. As the number of queried sample grows, the Intent Entropy show stabler tendency earlier than the Entropy and the Slot Entropy. After a certain point, the Entropy and the Slot Entropy can firmly achieve higher overall accuracy than the Intent Entropy. We notice that the Entropy and the Slot Entropy show almost the same performance on these datasets. According to the equation (25), the Entropy is approximated as the sum of the Intent Entropy and the Slot Entropy. With the growth of the sequence length, the amount of Slot Entropy also grows while the Intent Entropy almost stays the same. Therefore the Slot Entropy becomes the decisive part of the Entropy and shows similar performance as the Entropy. With the help of the Intent Entropy, the Entropy can outperform the Slot Entropy on three of all datasets according to the results. These results indicate that the completeness of information is crucial for the complete NLU task. Influence of Different Base Model: To demonstrate that our framework can still be effective when the base model is changed, we introduce BERT (Devlin et al., 2019) into our framework and conduct experiments on all datasets. The output of BiLSTM is replaced by the output of BERT and the feature I is now replaced by special embedding ([CLS]) to provide intent information. Other components keep the same as previous settings. The result is shown in Figure 5 in Appendix B.
From the figure, all AL methods still achieve Q needed earlier than the random sampling. Although BERT can provide richer semantic information and made the model get higher overall accuracy than previous model, it is still difficult for random sampling to keep the uptrend after the early stage. With the help of uncertainty sampling, the model can quickly get rid of this dilemma and achieve Q needed . This indicates that our framework can still be effective when the base model is changed.

Conclusion
In this paper, we propose a multitask active learning framework for NLU that focuses on making use of the relation information between ID and SF. We implement representative pool-based query strategies that include Least Confidence, Margin Sampling and Entropy in our framework. We also perform an efficient computation for the entropy of a joint-model. Experimental results show above query strategies with our framework can achieve competitive performance with less training data than baseline methods on all datasets. The results also demonstrate that making use of the relation information between tasks may achieve better performance rather than only consider intents. Additionally, our framework is still useful when the model is changed. These results suggest that the framework has the potential to be applied for industrial use.  Table 3: Detailed AUC values for all query strategies on all evaluation dataset. For each dataset (column), the best result is shown in bold and the second best is shown underlined.