Multi-Task Learning for Logically Dependent Tasks from the Perspective of Causal Inference

,


Introduction
Multi-task learning (MTL) has received increasing interest with the knowledge transfer potential among related tasks (Caruana, 1997;Ruder et al., 2017;. Recently, hierarchical MTL models (Hashimoto et al., 2017;Sanh et al., 2018) were proposed for tasks with dependencies and achieved better performance than democratic ones. In their models, the encoders of different tasks were stacked. And the proposition is that complex tasks at top layers require deep processes (c) Logically Dependent Figure 1: Three types of MTL schemes for two tasks t m and t n : (a) Democratic MTL shares an encoder S and owns task-specific encoders T i (i = m or n); (b) Hierarchical MTL stacks those task-specific encoders; (c) Logically dependent MTL further considers the label dependencies by re-utilizing low-level tasks' labels.
to capture semantic richer features, and simple tasks at bottom layers require shallow processes. However, many hierarchical MTL models only consider the dependencies of feature representations and ignore the label dependencies. A direct comparison is shown in Figure 1. Let X denote a given input, and Y m and Y n denote the outputs of two tasks, respectively. The first two MTL schemes (Liu et al., 2017;Zheng et al., 2018;Sanh et al., 2018) substantially model the likelihood by P (Y m , Y n |X)=P (Y m |X)P (Y n |X) while the third MTL scheme (Bekoulis et al., 2018;Luan et al., 2019) From the causal perspective, the first two schemes assume that Y m and Y n are conditional independent, while the third scheme assumes that Y m has a causal effect on Y n . In this paper, we suggest that the causal effect is important for logically dependent tasks. And we propose a mechanism referred to label transfer (LT), which lets a task utilize the labels of all its lower-level tasks.
When utilizing discrete labels, there remains another issue. For example, the strategy in (Bek-oulis et al., 2018) used gold labels of low-level tasks during training and used predicted ones during testing. Apparently, there was a train-test discrepancy which leads to cascading errors between tasks. And it was similar to the exposure bias problem in the field of text generation (Ranzato et al., 2016). Recently, two approaches have been investigated to overcome the problem, which are Gumbel Sampling (GS) (Kusner and Hernández-Lobato, 2016;Nie et al., 2019) and Reinforcement Learning (RL) (Yu et al., 2017;. In this paper, we regard the logically dependent MTL as a task-level label generation problem. And we incorporate GS because it feeds the optimizer with low-variance gradients, improving stability and speed of training over RL. Specifically, our model samples a label from the predicted probability distribution for each task and feeds it to its higher-level tasks. And back-propagated gradients will penalize wrong predictions if the causal effect exists. From the perspective of causal inference, the sampling is a counterfactual reasoning process that can estimate the causal effect between different tasks' labels. And we hope a model properly cooperating causality will be more robust and transferable, as argued in (Schölkopf, 2019).
To verify the effectiveness of our model, we conduct experiments on two English and one Chinese MTL datasets. The results show that our model achieves state-of-the-art (SOTA) on 6 out of 7 subtasks and improves predictions' consistency. And we present the estimated causal effect for several cases, which is consistent with humans' prior knowledge. In conclusion, the contributions of this paper can be summarized as follows: • We view MTL from the causal perspective and suggest a mediation assumption instead of the confounding assumption in conventional MTL models.
• We propose a novel MTL model with two key mechanisms: label transfer and Gumbel sampling, which better utilize task dependencies and alleviate cascading errors.
• The experiments are carried on both English and Chinese datasets and demonstrate our model's effectiveness and better consistency of predicted results for subtasks.

Related Work
In natural language processing (NLP), many studies focus on modeling task dependencies to improve MTL. A line of work proposed hierarchical MTL architectures by stacking the encoders of different tasks with simple tasks at lower layers and complex tasks at top layers (Sgaard and Goldberg, 2016;Hashimoto et al., 2017;Sanh et al., 2018). And Zhong et al. (2018) proposed a topological MTL architecture based on the topological hierarchy of tasks. Another line of work tried to reencode the predictions of low-level tasks. Giannis et al. (2018) re-encoded the predicted labels with the highest probability of low-level tasks during testing, and encoded the gold labels during training. Luan et al. (2019) re-encoded the soft predictions during testing and also encoded the gold ones during training. Yang et al. (2019) proposed a bidirectional architecture producing initial probability distributions for different tasks and then refine the probability distributions by conditioning on each other during both training and testing. Our work is also related to some studies for text generation. The democratic and hierarchical MTL schemes in Figure 1 are similar to the nonautoregressive language models like BERT (Devlin et al., 2019) which is possible to generate syntactically incorrect sentences (Ghazvininejad et al., 2019). The logically dependent MTL scheme is similar to the autoregressive language model but remains a train-test discrepancy. GS or RL learning have been investigated to deal with the discrepancy (Kusner and Hernández-Lobato, 2016;Yu et al., 2017;Nie et al., 2019).

Task Definition
To show that our method's high versatility, we investigate three different MTL scenarios: joint entity and relation extraction (JERE), aspect-based sentiment analysis (ABSA), and legal judgment prediction (LJP).

Joint Entity and Relation Extraction
JERE includes entity mention extraction (EMD) and relation extraction (RE) (Li and Ji, 2014). Entity Mention Extraction. EMD can be formulated as a sequence labeling problem with a BILOU scheme (Sanh et al., 2018). Given a sequence of tokens X = {x 1 , x 2 , ..., x n }, EMD assigns a categorical label to each token Y (e) = ∈ C (e) where C (e) denotes a predefined set of categories. Relation Extraction. RE can be formulated as a multi-head selection problem (Bekoulis et al., 2018). For the given sentence X, RE will output a three-dimensional matrix Y (r) = {y (r) i,j,c }, i≤n, j≤n, c ∈ C (r) where y (r) i,j,k denotes a binary value representing the existence of the cth relation between the ith and the jth tokens, and C (r) denotes the set of categories. Consistent with (Bekoulis et al., 2018), we only consider relations between the last token of the head entity mentions. Redundant relations are therefore not classified.

Aspect Based Sentiment Analysis
The challenge of "SemEval-2014 Task 4" divides ABSA into four subtasks (Pontiki et al., 2014), and we consider two subtasks which are logically dependent. Aspect Term Extraction. ATE can also be formulated as a sequence labeling problem with BILOU scheme (Li et al., 2019) or BIO scheme (Luo et al., 2019b;. Similar to EMD, ATE assigns a categorical label to each token where C (t) denotes a predefined set of categories. Aspect Category Detection. ACD is to detect the aspect categories for the given sentence X, which can be formulated as a multi-label classification problem. Let C (d) denotes a predefined set of categories, we need to predict a label set c is a binary value representing existence of the cth category.

Legal Judgment Prediction
LJP aims to predict the judgment results of legal cases, such as relevant law articles and charges. In this paper, we consider three subtasks for Chinese LJP: Relevant Article Prediction (RAP), Charge Prediction (CP), and Prison Term Prediction (PTP). Following previous work (Zhong et al., 2018;Yang et al., 2019), we only consider those cases with single relevant article and single charge, and divide the prison term into nonoverlapping intervals. Then each subtask can be formulated as a single-label classification problem. Specifically, for the given case X, LJP is to assign labels y (a) ∈ C (a) , y (c) ∈ C (c) , y (p) ∈ C (p) where C (a) , C (c) and C (p) are the sets of categories of RAP, CP, and PTP, respectively.
For the three MTL scenarios, the common point is that the subtasks in each are logically dependent. And we have prior knowledge of that the logical orders are EMD→RE, ATE→ACD, and RAP→CP→PTP respectively. And the first scenario is to investigate the knowledge transfer between two token-level tasks, and the second scenario is from a token-level to a sentence-level task. The third is among three sentence-level tasks.

Methodology
In this section, we first analyze MTL from the causal perspective in Subsection 4.1 and then introduce our models in the following subsections.

Features:
Outputs: Figure 2: Two kinds of typical causal assumptions. Let H be the feature representation for X, and Y m and Y n be the outputs for two tasks t m and t n , respectively. The confounding assumption (the left sub-figure) considers Y m and Y n to be conditionally independent, while the mediation assumption (the right sub- figure) considers the logical dependency from Y m to Y n .

Basic Causal Assumptions
Let X, Y be two variables representing a sequence of text and the corresponding label, and H be the feature representations of X. The causal graph is therefore X→H→Y . Previous work suggests that MTL may help extract common useful features (Liu et al., 2017), which mainly enhance the process X→H. When considering H→Y , there are two possible causal assumptions for MTL: the confounding and mediation shown in Figure 2 where Y m and Y n are the outputs of task t m and t n respectively.
The confounding assumption is that Y m and Y n are conditionally independent and only determined by H. However, For logically dependent tasks, we suggest a mediation assumption that Y m has a causal effect on Y n . Specifically, the assumption includes two causal paths between Y m and Y n . One links Y m to Y n through the mediator H (the solid line), known as the indirect effect. The other links Y m to Y n directly (the dashed line), known as the direct effect.
It is worth noting that someone may argue that Features:

Outputs:
Inputs: Y m and Y n can have mutual causal effects on each other, but the causal graphs are acyclic in most cases. Moreover, recent work has demonstrated that the hierarchical order matters (Sanh et al., 2018).

Full Causal Graphs
We denote our model as causal multi-task learning (CMTL) and show the full causal graphs in Figure  3 when considering more than two tasks. We also compare with other three typical MTL models, including fully-shared multi-task learning (FSMTL), shared-private multi-task learning (SPMTL), hierarchical multi-task learning (HMTL). FSMTL shares the feature representation H s for all tasks, and SPMTL learns a task-specific representation H k based on H s for task t k . HMTL stacks the encoders of different tasks. Our model CMTL is derived from HMTL, but the main difference is that CMTL incorporates the inter-task causality through two paths. It first creates an intermediate variable T H k−1 conveying the label information of all tasks preceding t k . Then the model involves the indirect causal effect by the path T H k−1 →H k →Y k . And it also involves the direct causal effect by the path T H k−1 →Y k .

Model Details
The architecture of our model is shown in Figure  4. Generally, the indirect causal effect is implemented by the solid lines connecting "Label Transfer" and "Encoders". And the direct causal effect is implemented by the dashed lines connecting "Label Transfer" and "Predictors". Token Embedding. Firstly, given an input sentence, X with length n, a token embedding layer is used to map each token into a fixed-dimensional vector. When combining with BERT (Devlin et al., Label Transfer Figure 4: The architecture of CMTL. The indirect and direct causal paths are implemented by the solid lines connecting "Label Transfer" and "Encoders", and the dashed lines connecting "Label Transfer" and "Predictors", respectively. 2019), we will keep it fixed during training to save the cost of memory. And BERT will convert X to a sequence of WordPiece tokens with a length greater than n. To make it suitable for token-level tasks, we select the first WordPiece token embedding for each original token. Furthermore, we use the normalization of different layers, which is similar to ELMo (Peters et al., 2018) to utilize deep and shallow embeddings. The final token embeddings are denoted by E = {e i } 1≤i≤n . Label Embedding. As shown in Figure 4, label embedding layers are used to encode the labels of tasks. We denote the gold label of task t k as Y k = {y k i }, y k i ∈ C k where C k is the set of categories. Let 1≤i≤n when t k is a token-level task, and let i=0 when t k is a sentence-level task (Y k contains only one element). Then our model encode Y k to label embeddings LE k = {le k i }. Specifically, our model convert y k i to a one-hot vector y k i , and compute the label embedding by: where W k is the parameter matrix for task t k with shape d l × |C k | and d l is the dimension of label embeddings. In this way, the labels of different tasks are mapped into the same latent space. Label Transfer. After label embedding, for a task, we want our model to utilize the label information of all its preceding tasks instead of only the last one. This process is naturally suitable for recurrent neural networks (RNNs) in which the k-th element depends on its preceding k−1 elements. In our model, RNN-LSTM (Hochreiter and Schmidhuber, 1997) is adopted for LT which maintains transferred hidden states T H k = {th k i } for task t k . And the computation of th k i can be expressed as: Similarly, we have 1≤i≤n when t k is a tokenlevel task, and let i=0 when t k is a sentence-level task. It means that Equation 2 can be used for token-level or sentence-level transfer. Encoders. Then the transferred label will be fed to encoders. As shown in Figure 4, the inputs of Encoder (k) include three parts: the token embeddings, the transferred label, and the outputs of (k−1)th encoder. And the outputs H k can be represented by: Generally, the choice of encoder falls into three categories: RNN, CNN, and Transformer (Vaswani et al., 2017). And we mainly implement RNN and CNN in our model because the memory complexity of Transformer is O(n 2 ) (Kitaev et al., 2020), which is much higher for long text.
For the MTL scenarios JERE and ABSA, the encoders are based on bidirectional LSTM (BiL-STM). Equation 3 becomes: where ⊕ denotes the concatenation operation along the last dimension. And H k is the feature representation of X for task t k with shape n × d h , and d h is the size of hidden states.
For the MTL scenario LJP, the subtasks are all sentence-level classification tasks and we empirically find that involving CNN performs better than simply adopting BiLSTM (see Section 5.4). And we employ CNN encoders (Kim, 2014) followed with max pooling to generate initial sentence-level representations h k init and a shared LSTM layer to generate final representations h k 0 for task t k : where th k−1 0 denotes the sentence-level transferred label embedding. Predictors. Then the predictors will process H k (or h k 0 ) and T H k−1 (or th k−1 0 ) as follows: where 1≤i≤n for JERE and ABSA, and i=0 for LJP. And y k i is the predicted probability distribution with shape for the categories in C k . The predictors are simply based on fully connected networks, and sequence labeling tasks do not involve conditional random field (CRF) (Lafferty et al., 2001). More details can be found in Appendix.

Gumbel Sampling
During the training stage, for task t k , we can pre-train the network using the gold labels Y j ={y j i } j<k of all the low-level tasks. However, the train-test discrepancy has not been tackled because the model uses predicted labels of low-level tasks during testing. To deal with the problem, we use GS to sample a label from the predicted probability distribution y j i . Specifically, we assume that Predictor (j) involves a logit value o j i followed with a softmax function to produce the probability distribution y j i .
Gumbel-softmax uses a re-parameter trick to approximate the multinomial sampling by: where g samples from Gumbel(0, 1) and τ is the temperature. When τ →0, y j i approximated to the one-hot vector of a sampled value from y j i . During training, our model will replace gold labels by y j i . And then a low-level task will have certain probabilities to sample a counterfactual value, and get feedback from high-level tasks if the causal effect is actually existed.

Interpretation From a Causal Perspective
After training, we attempt to interpret our model from a causal perspective to inspect what matters for predictions. The key idea is causal effect estimation from observed data (Veitch et al., 2019), which is based on Pearl's theory with the intervening operation (Pearl, 2010). Considering two tasks t m and t n , we would estimate the causal effect of a label y m i of task t m on a label y n j of task t n . We first intervene y m i to get a random counterfactual value y m i ̸ =y m i . Under the mediation assumption, the average causal effect is estimated by: )] (10) where E(·) stands for the expectation operation on the observed data, and f (y n j |y m i , H n (y m i ) stands for the predicted probability of y n j given y m i and the corresponding features H n (y m i ) for task t n . Besides estimating the causal effect of labels, we also inspect the influence of the elements in X like n-grams. We can intervene the original sequence to get another text sequence X xg with an n-gram x g masked out. Since n-grams may be quite sparse, only the individual causal effect is estimated: where f n (·) represents the prediction for task t n given a text sequence.

Datasets
To verify the effectiveness of our model, we experiment on three datasets corresponding to three MTL scenarios. For JERE, we use the ACE05 corpus (Doddington et al., 2004), which covers 7 types of entities and 6 types of relations. We use the same data splits as previous work (Katiyar and Cardie, 2017;Sanh et al., 2018). For ABSA, we use restaurant domain reviews of SemEval-2014 task 4 (Pontiki et al., 2014). ATE is a simple BIO tagging task, and ACD is a multi-label classification task with 5 categories. Furthermore, we randomly hold-out 15% of the training data as the development set. For LJP, we use the CAIL (Chinese AI and Law Challenge) 2018 dataset. Following (Zhong et al., 2018;Yang et al., 2019), we only consider those cases with a single law article and single charge. Meanwhile, those infrequent law articles and charges (less than 100 in the train set) are not included. And we divide the terms into non-overlapping intervals, which is consistent with (Zhong et al., 2018). The number of categories for RAP, CP, and PTP is 94, 116, and 11, respectively. The statistics of the filtered datasets can be found in Table 1.

Baselines
The compared models include two single-task models which use BiLSTM and CNN (Kim, 2014) as encoders respectively, and three conventional multi-task models including FSMTL (Liu et al., 2017;Zheng et al., 2018), SPMTL (Liu et al., 2017;Zheng et al., 2018), and HMTL (Sanh et al., 2018). Besides these models, we also compare several SOTA models for each MTL scenario, which will be cited in the following subsections.

Evaluation and Settings
The evaluation metrics are micro Precision (P), Recall (R), and F 1 scores for each subtask in JERE and ABSA. For LJP, the evaluation metrics are accuracy (Acc.), macro P, macro R and macro F 1 scores which are consistent with (Zhong et al., 2018;Yang et al., 2019). For all models, we use the Allennlp framework to build neural networks . The hidden size of BiLSTM and label embeddings is 300. We also investigate each model with BERT-large-uncased (Devlin et al., 2019) or 300dimensional Glove (Pennington et al., 2014) for JERE and ABSA. For LJP, since it is a Chinese dataset, we use THULAC (Maosong et al., 2016) for word segmentation. We randomly initialize the http://cail.cipsc.org.cn/index.html  Table 2: Experiment results of different models for JERE and ABSA. The results marked with (*) means that their models use an additional task, Coreference Resolution (CR). Note that previous SOTA models are task-specific, which means that the SOTA models for JERE (or ABSA) are not ready for ABSA (or JERE). The rows with "Gold Labels" means using gold labels of low-level tasks.  Table 3: Experiment results of different models for LJP. The rows with "Gold Labels" means using gold labels of low-level tasks.
word embeddings. For CNN-based models, we set the number of filters as 512 and the sliding window length as 2,3,4,5 (each window contains 128 filters). The temperature of GS τ is set to 0.05. The batch size is 32, and the learning rate is 5 × 10 −4 . The maximum training epochs for JERE, ABSA, and LJP are 80, 20, and 10 respectively, and each model will stop training when F 1 scores reach the lowest on the development set in past 10 epochs (the patience is set to 10). Moreover, a special setting for LJP is that we pre-train our model for 5 epochs with gold labels and train the next 5 epochs with GS. For the other two MTL scenarios that have fewer categories of labels, pre-training is not involved. We report the averaged metrics after the training process is repeated 5 times.

Main Results
We first present the experiment results on two English datasets for JERE and ABSA, respectively, in Table 2. Regarding JERE, our model achieves a SOTA result on RE and beats the model proposed by (Luan et al., 2019), which uses an additional task (coreference resolution, CR) by 2.60 F 1 points. Among the models without CR, our model achieves the best results on both EMD and RE, improving F 1 scores by 1.93 and 2.97 points, respectively. Regarding ABSA, our model achieves new SOTA results on ATE and ACD, which increases F 1 scores by 2.62 and 0.39 points, respectively. Table 2 also shows the results of conventional MTL models, including FSMTL, SPMTL, and HMTL. Our model consistently outperforms them when using or not using BERT embeddings. Then we present the experiment results on the Chinese dataset for LJP in Table 3. Note that "HMTL-CNN" is not merely replacing the BiL-STM encoders of "HMTL" by CNN, because we empirically find this way does not perform well. Therefore, we denote the ablated model of our model as "HMTL-CNN" which is consistent with the scenarios JERE and ABSA. It is also worth noting that the previous SOTA models for LJP are re-implemented by us because we get slightly different data splits after preprocessing the dataset. Moreover, they did not utilize the development set and only tested their models after training certain epochs. As a result, some models may greatly suffer from overfitting at the final epoch. In our experiments, as shown in Table 3, previous SOTA models perform best on PTP, and our model further improves the F 1 scores by 1.72 points over the best of them. Compared with other baseline models, our model performs best on all subtasks.
We also show the results of a toy experiment where our model uses gold labels of low-level tasks in Table 2 and 3. In the three MTL scenarios, using gold labels of low-level tasks leads to performance gains to high-level tasks. The results confirm the existence of label dependencies between tasks. It means that if humans rectify the predictions of low-level tasks, our model can utilize them to improve the predictions of high-level tasks. And conventional MTL models can not utilize this information because they assume the labels to be conditional independent.  Table 4: Ablation analysis. The mechanisms GS and LT are eliminated from CMTL one after another, resulting in models "HMTL+LT" and "HMTL", respectively. Furthermore, only keeping the indirect or the direct causal path of LT results in models "CMTL (Indirect)" and "CMTL (Direct)", respectively.

Ablation Study
To analyze which mechanisms are driving the improvements, we present the results of an ablation study in Table 4. We first eliminate GS and LT from CMTL one after another, which results in models "HMTL+LT" and "HMTL". As shown, GS and LT are both influential, especially for highlevel tasks. For example, eliminating GS leads to a drop of F 1 score by 2.24 points on RE, and eliminating the two mechanisms leads to a significant drop of 4.47 points. Moreover, we only keep the indirect and the direct causal path of CMTL, which results in models "CMTL (Indirect)" and "CMTL (Direct)" respectively. Both the two ablated models perform slightly worse than CMTL. Moreover, the indirect causal path is more important than the direct one for most subtasks.

Case Study
Influence of Label Transfer. Generally, LT enables a high-level task to utilize all its lowerlevel tasks' predictions and, therefore, improves the consistency of the predicted results. To directly see the influence, we give some cases in Figure 5. For example, in Case 1, both HMTL and CMTL successfully recognize the entities, including "Asif Muhammad Hanif" and "London". Nevertheless, HMTL does not correctly predict their relationship "GEN-AFF" (citizens and the place they come from) while CMTL correctly predicts it. Another example is Case 2, which shows the translated document of a Chinese legal case. As shown, the relevant law is Article 238, which describes the crime of illegal detention. But the predicted charge of HMTL is kidnapping, which is a more serious crime. These inconsistent judgments are unacceptable to the judge or the public in practice.  Estimated Causal Effect. To show that the estimated causal effect (computed by Equation 10) in our model is consistent with humans' prior knowledge, we present some results in Table 5. As shown, Article 238 has a causal effect on Illegal Detention with 0.30 points, and has no effect on other charges. This is consistent with legal knowledge (view Article 238 in Figure 5). And Article 350 has a causal effect on Causing Traffic Accident with 0.90 points that are quite high. The reason may be that Article 350 has only 163 samples in the train set, while Article 238 has 1,427 samples. The confidence of infrequent labels can be greatly improved by knowing the low-level gold labels. The third row shows the kind of one-tomany causal effect that Article 133 has a causal effect on both Causing Traffic Accident and Dangerous Driving. But the effect is small, and the prediction should mainly count on the features extracted from the input text.

Case 3 for LJP:
Influential 4-grams Article 133 Article 233  Influence of Gumbel Sampling. GS enables a low-level task to get useful feedback from its higher-level tasks. To see the influence, we show another case for task RAP (the lowest-level task in LJP) in Figure 6. As shown, the gold label of the case is Article 133, which is about dangerous driving. Without GS, the model HMTL+LT predicts an incorrect result, Article 233, about negligently causing one's death. The two law articles both describe one's death, but Article 133 has priority in the event of traffic accidents. Figure 6 shows the estimated causal effect of each 4-gram (computed by Equation 11). CMTL captures the priority of Article 133 by understanding that the translated ngram "mainly responsible for the accident" is more important (with a causal effect of 0.80 on Article 133) than the n-gram "died after being rescued by the hospital" (with a causal effect of −0.02).

Conclusion
In this paper, we investigated the MTL problem with logically dependent tasks. We first analyze MTL models from the perspective of causal inference and then propose a model CMTL to utilize task dependencies properly. The model achieves SOTA results on 6 out of 7 subtasks and improves the consistency of predicted results of different subtasks. In the future, we are interested in social science topics, such as modeling the causal effect between mental health and the suicide decisions reflected through social media, which may help predict and stop the final decisions.

A Details of Predictors and Loss Function
The subtasks in three MTL scenarios can be categorized into four types: sequence labeling, intertoken multi-label classification, multi-label text classification, and single-label text classification.
Since the primary goal of our work is to investigate the task dependency, the architectures of predictors are based on simple fully-connected neural networks (FCNNs).
• If the task t k is EMD or ATE which belongs to sequence labeling, given feature representations H k ={h k i }, 1≤i≤n, the prediction layer will make a token-level prediction as follows: where W k p and b k p are the trainable parameters of the FCNN. The loss is computed by cross-entropy: where y k i,c denotes the ground-truth value of the cth category for ith token, and C k is the set of the categories.
• If task t k is RE which belongs to inter-token multi-label classification, and the prediction process can be described by: where y k i,j denotes the predicted probability distribution of the relations between the ith and jth tokens. And V k p , U k p , W k p and b k p are the trainable parameters of the FCNN. The symbol σ(·) stands for the sigmoid function, and f (·) stands for the element-wise activation function (relu in this paper). The loss L re is computed according to cross-entropy: where y k i,j,c denotes the ground-truth value of the cth category for the relation between the ith and jth tokens. Note that we only consider relations between the last token of the head entity mentions. Redundant relations are therefore not classified.
• If task t k is ACD which belongs to multilabel text classification, the prediction process is as follows: where h k 0 represents the sentence-level feature representation obtained by the maxpooling function pool(·) over the tokens. And the predicted probability distribution is: Then the loss is computed by cross-entropy as follows: • If task t k is RAP, CP, or PTP which belongs to single-label text classification, the prediction process is as follows: where the difference from multi-label text classification tasks is the use of softmax(·) instead of σ(·). And the loss is also computed by cross-entropy: When considering multi-task learning, we denote the set of tasks by T = {t 1 , t 2 , ..., t k , ..., t K } and then sum up the losses of tasks by: where λ k is the hyper-parameter for each task t k . We empirically set λ k = 1.0 in this paper.

B Validation Performance During Training
We also show the validation performance of several models on the highest-level subtask for three MTL scenarios during training in Figure 7. As shown, the F 1 scores and losses on the development set of three models are presented, including Each model was trained on a Tesla P100 with a maximum memory of 16GB. CMTL and two ablated models, HMTL+LT and HMTL.
For RE, HMTL ran out of patience at about 35 epoch as it reached the lowest F 1 score in the past 10 epochs. And HMTL+LT and CMTL kept training for nearly 80 epochs. The best F 1 score of CMTL was slightly higher than that of HMTL+LT by 1.86 points, and the loss curve was more stable. Similarly, for ACD and PTP, the best F 1 scores of CMTL were consistently higher than the other two models, and the loss of CMTL was relatively stable. These results demonstrate that our model can better utilize the task dependencies and be more robust than the other two ablated models.
Moreover, an interesting result was that the validation loss of HMTL+LT grew faster than HMTL. The reason may be that the predicted labels of lowlevel tasks in HMTL+LT excessively influenced the decision of high-level tasks, leading to cascading errors. If an incorrectly predicted label of the low-level task is fed, the high-level task will have high confidence to make a wrong prediction, making the loss of HMTL+LT larger than HMTL. By adding Gumbel sampling, our model achieved the smallest loss on the development set, which indicated that Gumbel sampling properly considered the causal effect and alleviated the cascading errors.