Multitask Learning for Emotionally Analyzing Sexual Abuse Disclosures

The #MeToo movement on social media platforms initiated discussions over several facets of sexual harassment in our society. Prior work by the NLP community for automated identification of the narratives related to sexual abuse disclosures barely explored this social phenomenon as an independent task. However, emotional attributes associated with textual conversations related to the #MeToo social movement are complexly intertwined with such narratives. We formulate the task of identifying narratives related to the sexual abuse disclosures in online posts as a joint modeling task that leverages their emotional attributes through multitask learning. Our results demonstrate that positive knowledge transfer via context-specific shared representations of a flexible cross-stitched parameter sharing model helps establish the inherent benefit of jointly modeling tasks related to sexual abuse disclosures with emotion classification from the text in homogeneous and heterogeneous settings. We show how for more domain-specific tasks related to sexual abuse disclosures such as sarcasm identification and dialogue act (refutation, justification, allegation) classification, homogeneous multitask learning is helpful, whereas for more general tasks such as stance and hate speech detection, heterogeneous multitask learning with emotion classification works better.


Introduction
The #MeToo movement 2 was started as an initiative to empower women against long-standing issues related to sexual abuse at workplaces, public spaces, and private organizations (McKenna and Chughtai, 2020). The usage of a dedicated hashtag #MeToo on media platforms signified a social support system for women from different sections of society. The movement initiated discussions on many socially stigmatized issues that were missing from the virtual space (Clark-Parsons, 2019). Such conversations invited various reactions on the web, involving support to the cause of the movement and even outright bullying. While many users took part in the vilification of the survivors, the movement also saw opposition by factions of the society that felt threatened by the impact of social media in raising awareness about the scale of everyday sexual harassment faced by women in workplaces and institutions (Tambe, 2018). In many instances, the public disclosures of survivor-narrated incidents involved widespread use of hate-language and online trolling, both against the victims and alleged oppressors (Franks, 2019). The #MeToo movement also led to people coming out with allegations, refutations, and justifications about traumatic experiences as they transitioned to active participants in the mainstream conversation . A closer look at the online posts about the #MeToo movement revealed that sarcasm was often used as a thin veil in such discussions to humorously mask disapproval, wit, and personal attacks (Sandhu et al., 2019).
The complex narratives present in the conversations on stigmatized issues like sexual abuse create an opportunity for researchers to study how people express their opinions on a sensitive topic in an informal social setting. It also offers a chance to social media regulators for fostering social inclusion, community integration, and improving the individual perception of being supported by others. This paper aims at categorizing the posts related to the #MeToo movement on the basis of stance (support or opposition), hate-speech, sarcasm, and dialogue acts (allegation, refutation, or justification of sexual misconduct). We focus our analysis on a publicly available dataset that is created in the backdrop of mass instances of sexual harassment disclosures and includes nuanced labels to identify accompanying linguistic behaviors.
Existing literature has emphasized that the text's emotional attributes have a high correlation with dialogue narratives describing instances of sexual harassment (Lane and Hedin, 2020). Prior works (Anzovino et al., 2018;Sharifirad et al., 2018) have mostly focused on label specific detection of linguistic narratives related to sexual harassment disclosures in isolation by exploiting lexical features (Chowdhury et al., 2019;Karlekar and Bansal, 2018). However, subtle intricacies present in the discussion of sexual abuse disclosures often reflect the speaker's affective and psychological state, which are overlooked by feature-engineered models. For instance, part (a) of Figure 1 shows a tweet expressing support towards the #MeToo movement but in a tone that might be difficult for naive neural learning models to capture without context. Part (b) of Figure 1 presents a tweet in which the author has an initial positive outlook, which later reverses to disgust for the subject. The lack of context about the event and contrasting qualifications describing the oppressor makes the correct classification of the sexual harassment disclosure label extremely challenging for traditional classifiers without emotional labels' additional supervision.
Moreover, apart from their inherent complexity, conversations related to the #MeToo movement also pose a challenge of emotional ambiguity. This work is the first attempt at joint modeling of narratives related to sexual abuse disclosures and emotion classification to learn the patterns of their interaction via parameter sharing techniques offered by Multitask Learning (MTL). The affective features, which result from a joint learning setup through shared parameters, will encompass the text's emotional content that is likely to be predictive of narratives corresponding to sexual abuse disclo-sures. More specifically, we formulate an MTL framework for multi-label classification of narratives related to sexual abuse disclosures (stance, hate-speech, sarcasm, dialogue acts) and emotional classification in the context of the #MeToo movement. MTL (Caruana, 1997) allows two or more related tasks to be learned jointly. This facilitates the transfer of inductive bias and better generalization across related tasks on account of shared representations of linguistic features.

Contributions
We experiment with MTL architectures employing a flexible cross-stitched parameter sharing method that benefits from both hard-parameter sharing and soft parameter sharing through a gated mechanism using a weighted summation (Section 4). Hard parameter sharing allows for sharing lower-level word representations, and soft parameter sharing permits the sharing of task-specific networks. We explore two flavors of multitask learning: (i) Homogeneous MTL -Intradomain MTL between related tasks of sexual abuse disclosure narratives, and (ii) Heterogeneous MTL -cross-domain MTL between pairs of tasks in emotion classification and narratives of sexual abuse disclosure (Section 5.2). Our results demonstrate that both Homogeneous and Heterogeneous MTL setups outperform the Single Task Learning (STL) technique across various tasks (Section 6). Further, we conduct a qualitative analysis of several samples to analyze the benefit of joint training of related tasks (Section 6.4), keeping in mind the ethical concerns of communities affected by this research (Section 7).

Related Work
Sexual Harassment Disclosures on Social Media Several works have focused on identifying sexual violence (Leatherman, 2011), harassment and sexism (Wekerle et al., 2018;Manikonda et al., 2018b) in social media posts by analyzing factors such as linguistic themes, social engagement, and lexical attributes. Jha and Mamidi (2017) experimented with algorithms such as SVM and BiLSTM along with fastText to categorize hostility of sexist posts. (Parikh et al., 2019) proposed a multi-label CNN-based neural architecture along with word and sentence level embeddings for identifying variants of sexism present in online social platforms. Chowdhury et al. (2019) emphasized the use of linguistic themes, contextual meta-data, and semantic cues for evaluating human behaviors related to sex-ual abuse disclosures. All of these works have dealt with modeling sexual disclosure narratives as single-task learning problems and were restricted to label specific detection (Marwa et al., 2018;. Multitask Learning Frameworks for learning representations across two different sources within the same domain follow multitask learning (Caruana, 1997). The ability to utilize knowledge from various sources compensates for missing data and complements existing meta-data (Tan et al., 2013;Ding et al., 2014), thus allowing for effective sharing of task-invariant features (Caruana, 1997;Zhang and Wang, 2016;Zhang et al., 2018). MTL has been utilized for name error recognition (Cheng et al., 2015), tagging-chunking (Collobert et al., 2011), machine translation (Luong et al., 2015) and relation extraction (Gupta et al., 2016). Liu et al. (2017)

Problem Description
We aim to analyze different perspectives of the complex narratives pertaining to the #MeToo movement on social media platforms. Specifically, given a tweet text, we formulate for it a multi-label multiclass classification problem with definitions taken from previous works (ElSherief et al., 2018) • Stance Detection: Determining the opinion of the author of a tweet, regarding a particular target of interest (Augenstein et al., 2016). Stance detection is categorized into three classes: Support for when the author favors the #MeToo movement or it's cause; Opposition, representing opposing stance or indifference towards the movement; or Neither, when the text does not have a clear viewpoint (Mohammad and Turney, 2013 • Sarcasm Detection: Given a tweet t i , we aim to map it to either be Sarcastic or Not Sarcastic based on the presence of implicit sarcastic tone of the post (Bamman and Smith, 2015).
• Dialogue Act Classification: These are a function of a speaker's utterance during a conversation, for example, question, answer, suggestion, etc., and are classified into three classes, namely Allegation (when the author intends to allege an individual or group of sexual misconduct) (Hutchings, 2012), Justification (tweets where the author is justifying their actions), and Refutation (for when the author refutes any accusation with or without evidence) .

Modeling Settings
To validate MTL's performance across different domains, we also experiment with emotion detection as the auxiliary task. We aim to predict one or more of the several emotions representing the affective state of the authors -(anger, disgust, anticipation, fear, joy, love, optimism, pessimism, sadness, surprise and trust). We conceptualize three diverse problem settings and compare them to analyze MTL within and across domains. These are (i) Single Task Learning: Independent optimization of the four mentioned tasks associated with sexual abuse disclosure narrative classification, (ii) Homogeneous Multitask Learning: Simultaneous optimization of a pair selected from the four tasks associated with the sexual abuse disclosure posts, and (iii) Heterogeneous Multitask Learning: Classification of narratives associated with sexual abuse disclosure as the primary task and emotion detection as the auxiliary task.

Text Encoding
Building on the success of transformer-based models in NLP, we chose BERTweet (Dat Quoc Nguyen and Nguyen, 2020), a pre-trained language model trained on 850 million English tweets. BERTweet has been trained with the same training procedure as RoBERTa (Liu et al., 2019) and has the same model configuration as the BERT base architecture (Devlin et al., 2019). The key component in  transformer-based models is the token level selfattention (Vaswani et al., 2017) that enables them to generate dynamic contextualized embeddings as opposed to static embeddings of GloVe (Pennington et al., 2014). Let (w 1 , w 2 , ..., w n ) represent the sequence of tokens from a given tweet t.
These tokens are pre-processed and passed through BERTweet 3 . We consider embeddings from the last layer of BERTweet and obtain an embedding e i for a given tweet t i . Embedding for each tweet is of dimension m × k, where k represents the dimension size of BERT based model and m represents the maximum length for the tweets.
These representations from Equation 1 are passed through a stacked BiLSTM encoder. Dropout is then applied to these encoded representations h (t) (Equation 4 represents general formulation for both the tasks). These are then passed to a BiLSTM decoder, followed by a dropout layer and then a linear output layer to get output o (p) (p representing primary task) or o (a) (a representing auxiliary task).

Single Task Learning
We treat the task of categorizing narratives related to sexual abuse disclosure -Stance, Hate Speech, 3 Implementation used for BERTweet is available here Sarcasm and Dialogue Acts, independently. Each STL model is given an input representation e (Equation 1). Within the proposed tasks for classifying sexual abuse disclosure narrative for the tweets related to the #MeToo movement (Section 3), we use sigmoid activation for Sarcasm detection (whose classification outputs are binary) and softmax activation for all other tasks for the final output layer.

Model Optimization
To account for the imbalance present among the labels, we use classbalanced focal loss as the optimization loss function (Cui et al., 2019), as formulated in Equation 5. Given a sample class i containing n i samples in total, it adds a weighting factor of (1−β) (1−β n i ) with parameters β ∈ [0,1), where n y is the number of samples in the ground truth class y. The proposed class-balanced term is model agnostic. p represents predicted class probabilities and L represents the choice of the loss function (binary cross entropy for Sarcasm and categorical cross entropy for others).

CB(p, y)
As for the multilabel emotion classification task, the unnormalized output (assuming one or more of 11 different emotions) is subjected to a Sigmoid activation, and the network is optimized using binary cross-entropy (BCE) as: yi.log(p(yi)) + (1 − yi).log(1 − p(yi)) where N is the number of training samples, y and p(y) denotes true and predicted labels respectively.

Multitask Learning
For our MTL approach, we use two optimization objectives: one for the primary task, which can be any of the proposed tasks for classifying tweets related to #MeToo movement (Section 3), and other for the auxiliary task, which can be either a task related to classifying sexual abuse disclosure for #MeToo movement (Homogeneous MTL) or emotion classification task (Heterogeneous MTL). The two objectives are weighted by a parameter γ, which controls the importance placed on the auxiliary task (1 − γ for the primary task). Multitask learning frameworks are generally built using either of these two approaches: hard parameter sharing or soft parameter sharing. In a hard parameter sharing model (Caruana, 1997), both the primary and auxiliary tasks have a shared encoder followed by separate task-specific network branches, and the shared encoder is updated by both the tasks alternately. On the other hand, in the soft parameter sharing approach, tasks have different encoders with independent parameters, and the distance between their parameters is regularized using a regularization constraint (Duong et al., 2015;Yang and Hospedales, 2016), to encourage the parameters to be similar.
Flexible Cross-Stitched Parameter Sharing Architecture: We design our model so that the task-agnostic textual feature representations benefit from hard sharing while the regularization of the task-specific features can be learned according to task pair settings. We call our approach flexible cross-stitched parameter sharing, presented in Figure 2. Specifically, we train two separate models (one for each task) in tandem while also having a shared encoder that is updated by both of them and weighted joint learning of primary task decoder parameters that are tuned specifically for the task. This allows both the models to have their own set of parameters while also encouraging knowledge transfer via the shared encoder weights.
For each training pass of the primary task, the input representation e (p) is passed through (a) stacked BiLSTM encoder and (b) stacked shared BiLSTM encoder. This results in two contextualized word representations (h , where superscript (p) is used to denote the representations resulting from encoder in the primary task model and superscript (s) is used to denote the ones from shared encoder. We calculate the weighted summation of these two representations -h (p) , using two learnable parameters, α (p) and α (s) (where α (p) +α (s) = 1), as formulated in Equation 7 to regulate the information resulting from the two encoders ( Figure 2).
Such an approach to aggregate information flow from two encoders has facilitated success in prior Multitask learning settings as well (Rajamanickam et al., 2020;Dankers et al., 2019). As for our auxiliary task, we pass the embeddings e (a) through only the shared encoder (h (a) = h (s) ), followed by a dropout layer. We use this architecture for Heterogeneous MTL experiments. For Homogeneous MTL ones, we employ hard parameter sharing model due to statistical out-performance in this scenario. This technique consists of a single stacked encoder that is shared and updated by both tasks related to identifying narratives related to sexual abuse disclosures within #MeToo movement, followed by task-specific branches. The shared representations from the encoder are passed through the dropout layer.
These output representations (in the case of both Homogeneous and Heterogeneous experiments) are passed through respective BiLSTM decoders and dropout layers to get the final representation m (p) and m (a) , respectively for both the tasks. The auxiliary network branch is optimized using either Equation 5 (Class Balanced Focal Loss) or Equation 6 (Binary Cross Entropy), depending upon whether the auxiliary task is associated with identifying sexual abuse disclosure narratives or emotions. These output representations m (p) and m (a) are passed through a linear output layer to get unnormalized outputs o (p) and o (a) respectively. Sigmoid activation function is used for Sarcasm detection and the emotion classification task, and Softmax activation for others.

Data
MTL framework traditionally improves generalization by leveraging the domain-specific information due to the relatedness of the tasks present in the training signals (Caruana, 1997); hence we use two publicly available datasets mined from Twitter: n ) identify BERTweet word-level embeddings for the primary and auxiliary task respectively. The different arrows are used to indicate the alternate passes of the primary task (solid arrows) and auxiliary task (dotted arrows). Two controllable parameters α (p) and α (s) are used to control information flow from task-specific and shared encoder respectively, for the primary task.
Sexual Abuse Disclosures -#MeTooMA This dataset 4 has 9,973 tweets and covers different mutually non-exclusive linguistic annotations related to the #MeToo movement . The distribution and statistics about various labels are present in Table 1 and Section 3. We present an instance associated with each of the proposed tasks in Table 1. For our experiments, we focus only on tweets that are annotated as relevant to the #MeToo movement.

Emotions -SemEval18
This dataset 5 has been taken from SemEval-2018 Task-1 (Mohammad et al., 2018) and covers emotion-specific labels representing the mental state of the authors of the tweets. It consists of 10,986 tweets distributed across 11 emotion labels -(anger, disgust, anticipation, fear, joy, love, optimism, pessimism, sadness, surprise and trust), each being a binary label to indicate the presence of a particular emotion.

Task Specific Setting
Single Task Learning STL experiments optimize each of the tasks associated with identifying narratives related to sexual abuse disclosures within #MeToo movement (Section 3) and emotion 4 The publicly available dataset can be found at https: //doi.org/10.7910/DVN/JN4EYU. 5 https://competitions.codalab.org/ competitions/17751 detection, independently. We experiment with two distinct embedding spaces -GloVe-Twitter and BERTweet. Based on the superior performance of BERTweet with respect to GloVe-Twitter, we preferred it for further experimentation and studies.
Homogeneous Multitask Learning For this setup, we test the simultaneous optimization of two different tasks -both related to sexual harassment disclosure narratives, with one of them being primary and another coupled as the auxiliary. The results were obtained for a total of 12 pairs. Heterogeneous Multitask Learning In these sets of experiments, we evaluate the positive transfer of representations across datasets by considering the identification of narratives associated with sexual abuse disclosure as the primary task and emotion detection as the auxiliary task.

Experimental Setup
Preprocessing We pre-process tweet text by (i) normalizing user mentions and URLs, and (ii) translating the emoticon into text (Hutto and Gilbert, 2014). For tokenization, we use Tweet Tokenizer from NLTK. 6 Hyperparameters For our model 7 hyperparameters were tuned on the validation set to find the best configurations. We use a pre-trained BERTweet model to extract 768-dimensional token-level embeddings.
For each task associated with identifying narratives pertaining to the #MeToo movement in the MTL setup, its value is considered as the one where the model performance improved the most and for both the tasks. For instance, we find the optimal value of γ for hate speech (as the auxiliary task) to be 0.4 in all Homogeneous task cases and of emotion detection to be 0.2 for the Heterogeneous tasks. For the MTL experiments, α p and α s are learnable and tuned on the validation loss. The encoders consist of two stacked BiLSTM's with hidden size = 128. BiLSTM classifier has hidden size = 256, and the number of units in the penultimate dense layer is 128. Dropout is set to 0.3. For all our experiments, we use Adam optimizer (Kingma and Ba, 2014) and initialize model weights using Xavier initialization (Glorot and Bengio, 2010). We set the batch size to 128 and the learning rate to 1e − 3.
Training All models were trained until convergence for both primary and auxiliary tasks. For our MTL experiments, the training process involves alternating between primary and auxiliary task steps, with each task having its own loss function. All experiments are run using stratified 5-fold crossvalidation. We report the average macro F1 scores across the 5 folds to account for imbalance, as previously used in multi-label settings (Zhang and Zhou, 2013). 7 We used Keras with Tensorflow backend for implementing the models.   6 Results and Discussion

Single Task Learning
The aim of this paper is not limited to achieving the state of the art performance in terms of evaluation metrics but rather to conduct a thorough study to compare and contrast different methodologies for the benefit of the research community. As per our hypothesis and preliminary results on STL experiments on the #MeTooMA dataset, models trained using BERTweet embeddings perform far better than GloVe-Twitter. This is largely true because BERTweet is specifically pre-trained on English tweets and is better suited to handle Twitter-specific data, typically having a short length, informal grammar, and irregular vocabulary (e.g., abbreviations and typographical errors) (Kireyev et al., 2009).

Single Task Learning vis-a-vis Homogeneous Multitask Learning
Learning the affective states in the #MeTooMA dataset is challenging due to the inherently subjective nature of the tweets coupled with limitations on the data's size. Multitask learning achieves significant performance gains in terms of macro F1 score, as shown in Table 2 for all task pairs. The diagonal results represented in green denote the baseline STL results whereas ones highlighted in shades of blue represent results for pair-wise  Homogeneous MTL with row identifying primary task and columns denoting auxiliary task. The higher performance of Homogeneous MTL can be inferred to be indicative of better generalization when pairs of tasks are jointly modeled. Interestingly, these tasks show their best performance with the selective counterparts in the Homogeneous MTL setup. Stance detection is strongly coupled with Sarcasm labeling, and the same is seen to be true for Hate Speech classification and Stance identification. This selective out-performance of specific pairs of tasks can be attributed to the high correlation between the tasks themselves (Frenda, 2018;. For instance, the offensive text is often strongly coupled with sarcasm, as wit is a common linguistic denominator for understanding the intended meaning of phrases related to anger (Badlani et al., 2019). We further detail this through examples in Section 6.4.

Heterogeneous Multitask Learning
Results in Table 3 demonstrate that the Heterogeneous MTL setup achieves higher performance than Homogeneous MTL under similar settings in two out of four task pairs 8 -Stance and Hate Speech detection by the margins of +0.21 and +0.19 respectively. For the other two tasks, the performance of Heterogeneous MTL is very close if not better than Homogeneous MTL. These findings are in line with the claim supporting the generalizability across tasks in the #MeTooMA dataset, which is highly correlated to emotion recognition. This is indicative of positive knowledge transfer between the two domains. Such joint optimization boosts the overall performance of both primary and auxiliary tasks through parameter sharing to learn common representations that may be mutually beneficial to both related tasks.

Qualitative Analysis
To emphasize our proposed approach, we perform a qualitative study by handpicking examples from the dataset. We analyze token-level attention assigned to individual terms by BERTweet, where color intensity corresponds to the attention score. These results are shown in Table 4. We infer that Homogeneous and Heterogeneous multitask learning shows superior performance in every instance compared to STL. Learning effective features across the joint formulation of pair-wise tasks in Homogeneous MTL is evident from T 4 , where BERT's self-attention allots a higher weight to words such as ideology, stigma, and forward in line with the actual label as Support.
Similarly for T 5 , highlighted terms such as trap and bait are indicative of the opposing nature of the tweets, hence identified as belonging to Refutation. On the other hand, due to positive knowledge transfer from the emotion recognition task, Heterogeneous MTL obtains better performance in several cases. Words such as grave, mistake and swindling in T 2 connoted a negative emotion, hence accordingly being identified as belonging to the Oppose category. Similarly, terms such as hope and pain were given higher token-level attention in T 1 emphasizing a positive emotion and thus can be correlated with belonging to the Support category. An interesting observation is the presence of named entities in T 5 and T 6 , resulting in the incorrect prediction via Heterogeneous MTL. Therefore, a limitation of the single task learning and Heterogeneous MTL is the inability to mitigate the effect of named entities or specific events in the text to influence the knowledge transfer and create negative shared representations.

Ethical Concerns and Discussion
Analyzing social media data of individuals discussing sexual harassment disclosures and exploitation in public spheres necessitates the need to safeguard the ethics and privacy of individuals (Tusinski Berg, 2019). We address these: Generalization We acknowledge that the limitations of the experiments might get amplified due to the highly subjective nature of this challenging problem. Therefore it would not be fair to conduct a population-centric analysis based on inferences from this work.
Confidentiality Individual consent was not sought from social media users as the data was publicly available. Disclosure of sexual harassment information on public forums may have been met with public backlash and apathy. Therefore the social reputation of the accuser and the accused would be at a peril (McDonald, 2019). Hence, the authors were aware not to make any automated interventions, as any attempts to contact individuals could be seen as personally intrusive and might also repeal their social information (Fiesler and Proferes, 2018).
Bias & Discrimination Social support discussions on social media platforms gave victims the liberty to describe their instances of sexual exploitation and abuse (Manikonda et al., 2018a). The authors are aware of the potential inevitable sampling biases that may be present in the data. Importance has to be placed on mitigating the bias against certain minority groups, which might get amplified due to the sensitive nature of social discussions (Hellwig and Sinno, 2017).

Conclusion
In this work, we have proposed a flexible crossstitched multitask learning framework for the de-tection of narratives linked with sexual abuse disclosure on social media. Our methodology takes advantage of the affective features from emotions and related tasks to encourage knowledge transfer and attain auxiliary knowledge. Qualitative and quantitative results demonstrate how joint optimization of Stance detection and Sarcasm identification benefit each other, indicating their relatedness and dependence on each other. Similarly, we observe that tasks like Hate-Speech classification and Stance labeling benefit from each other and from emotion detection, thus reinforcing the benefit of joint linguistic learning between the related tasks. In the future, we aim to explore how this joint learning paradigm can be effectively leveraged for improving performance on downstream tasks like emotion analysis, identifying suicidal tendencies among abuse survivors. Application from this work also has utility for problems such as identification of patterns of reported sexual harassment narratives, hate speech detection, the spread of rumors and fake news, and entity extraction for digital vigilantism (Yuce et al., 2014;Hosterman et al., 2018).