Multi-label Categorization of Accounts of Sexism using a Neural Framework

Sexism, an injustice that subjects women and girls to enormous suffering, manifests in blatant as well as subtle ways. In the wake of growing documentation of experiences of sexism on the web, the automatic categorization of accounts of sexism has the potential to assist social scientists and policy makers in utilizing such data to study and counter sexism better. The existing work on sexism classification, which is different from sexism detection, has certain limitations in terms of the categories of sexism used and/or whether they can co-occur. To the best of our knowledge, this is the first work on the multi-label classification of sexism of any kind(s), and we contribute the largest dataset for sexism categorization. We develop a neural solution for this multi-label classification that can combine sentence representations obtained using models such as BERT with distributional and linguistic word embeddings using a flexible, hierarchical architecture involving recurrent components and optional convolutional ones. Further, we leverage unlabeled accounts of sexism to infuse domain-specific elements into our framework. The best proposed method outperforms several deep learning as well as traditional machine learning baselines by an appreciable margin.


Introduction
Sexism, discrimination on the basis of one's sex, prevails in our society in numerous forms, causing immense suffering to women and girls. Online forums have enabled victims of sexism to share their experiences freely and widely by facilitating anonymity and connecting far-away people. A meaningful categorization of these accounts of sexism can play a part in analyzing sexism with a view to developing sensitization programs, systemic safeguards, and other mechanisms against * The author is also an applied researcher at Microsoft. this injustice. Given the substantial volume of posts related to sexism on digital media, automated sexism categorization can aid social scientists and policy makers in combating sexism by conducting such analyses efficiently.
While sexism is detected as a category of hate in some of the hate speech classification work (Badjatiya et al., 2017;Waseem and Hovy, 2016), it does not perform sexism classification. Except the work on categorizing sexual harassment by Karlekar and Bansal (2018), the prior work on classifying sexism assumes the categories to be mutually exclusive (Anzovino et al., 2018;Jha and Mamidi, 2017). Moreover, the existing category sets number between 2 to 5. In this paper, we focus on the new problem of the multi-label categorization of an account of sexism reporting any type(s) of sexism. We create a dataset comprising 13023 accounts of sexism, including first-person accounts from survivors, each tagged with at least one of 23 categories of sexism. The categories were defined keeping in mind the discourse and campaigns on gender-related issues along with potential policy implications, under the guidance of a social scientist. Ten annotators, most of whom have formally studied topics related to gender and/or sexuality, were recruited to label textual accounts of sexism. The accounts are drawn from the Everyday Sexism Project website 1 , where voluntary contributors from all over the world document experiences of sexism suffered or witnessed by them. For classification experiments, the categories found in less than 400 accounts in our dataset are appropriately merged with others, resulting in 14 categories.
The rationale for formulating this classification as multi-label is that many experiences inherently involve multiple types of sexism. For instance, "I overheard a co-worker saying that I should be in more team events and photos because I am pleasing to the eye! Disgusting." describes an experience of sexism wherein the victim was subjected to three types of sexism, namely hypersexualization, sexual harassment, and hostile work environment.
We develop a novel neural architecture for the multi-label classification of accounts of sexism that enables flexibly combining sentence representations created using models such as BERT (Bidirectional Encoder Representations from Transformers) (Devlin et al., 2018) with distributional word embeddings like ELMo (Embeddings from Language Models) (Peters et al., 2018) and Global Vectors (GloVe) (Pennington et al., 2014) and a linguistic feature representation through hierarchical convolutional and/or recurrent operations. Leveraging general-purpose models such as BERT for encoding sentences likely makes our model better equipped to capture semantic aspects effectively, since they are trained on substantially larger textual data than the domain-specific labeled data that we have. Moreover, we adapt a BERT model for the domain of instances of sexism using unlabeled data. Embeddings from sentence encoders are complemented by sentence representations built from word embeddings as a function of trainable neural network parameters. We explore multiple ways to deal with the multi-label aspect. The adopted method produces label-wise probabilities directly and simultaneously using shared weights and a joint loss function. Our experimentation finds multiple instances of the proposed framework outperforming several diverse baselines on established multi-label classification metrics.
Our key contributions are summarized below.
• We propose a neural framework for the multilabel classification of accounts of sexism that can combine sentence representations built from word embeddings of different kinds through learnable model parameters with those created using pre-trained models. It yields results superior to many deep learning and traditional machine learning baselines.
• To the best of our knowledge, this is the first work on classifying an account recounting any type(s) of sexism without the assumption of the mutual exclusivity of classes.
• We provide a dataset consisting of 13023 accounts of sexism by survivors and observers annotated with one or more of 23 carefully formulated categories of sexism.

Related Work
Substantial work has been directed to hate speech detection in recent years. Since some of it involves the detection of sexism (Badjatiya et al., 2017;Waseem and Hovy, 2016), we review it along with the work on sexism classification.

Hate Speech Detection
Warner and Hirschberg (2012) Karlekar and Bansal (2018) focus on accounts of sexual harassment, exploring CNN and/or RNN for their 3-class classification.
As far as we know, our paper presents the first attempt to categorize accounts involving any type(s) of sexism in a multi-label way. Moreover, we provide a larger dataset and significantly more extensive or finer-grained categorization scheme than these papers.

Dataset Construction
Creating our multi-label sexism account categorization dataset entailed two parts: textual data collection and data annotation. To collect data, we crawled the Everyday Sexism Project website, which receives numerous accounts of sexism from survivors themselves as well as observers. After removing entries with less than 7 words, around 20000 entries were shortlisted for annotation; we prioritized shorter ones and tried to approximate the tag distribution on the website. Though shorter entries were preferred keeping in mind the potential future work of transfer learning to Twitter content, our neural framework is devised in a sizeagnostic way.
Under the direction of a social scientist, 23 categories of sexism were formulated taking into account gender-related discourse and campaigns (Dutta and Sircar, 2013;Eccles et al., 1990;Mead, 1963;Menon, 2012) as well as possible impact on public policy. Table 1 provides succinct descriptions for the categories.
We followed a three-phase annotation process to ensure that the categorization of each account of sexism in the final dataset involved the labeling of it by at least two of our 10 annotators, most of whom had studied topics related to gender and/or sexuality formally. The annotators were given detailed guidelines, which evolved during the course of their work. Each annotator was given training, which included a pilot round involving evaluation and feedback. Phase 1 involved identifying one or more textual portions containing distinct accounts of sexism from an entry obtained from Everyday Sexism Project and subsequently tagging each portion with at least one of the 23 categories of sexism, producing over 23000 labeled accounts. In phases 2 and 3, we sought redundancy of annotations for improved quality, as permitted by the availability of annotators adequately knowledgeable about sexism. Over 21000 accounts were categorized again in phase 2 such that the annotators for phases 1 and 2 were different. The inter-annotator agreement across phases 1 and 2, measured by the average of the Cohen's Kappa (Cohen, 1960;Artstein and Poesio, 2008) scores for the per-category pairs of binary label vectors, is 0.584. Each account for which the label sets annotated across phases 1 and 2 were identical was included in the dataset along with the associated label set. In phase 3, some of the accounts for which there was a mismatch between the phase 1 and phase 2 annotations were selected. For each account, the annotators were presented with only the mismatched categories and asked to select or reject each. Duplicates and records for which the Everyday Sexism Project entry numbers match but the accounts do not fully match were removed at multiple stages. In order to improve the annotation reliability further, some records for which the annotations differed across phases 1 and 2 were discarded based on the annotators involved and sensitivities of the categories, resulting in a multi-label sexism categorization dataset of 13023 accounts. For our automated sexism classification experiments, we merge the categories found in less than 400 records with others as follows, resulting in 14 categories. 'Menstruation-related discrimination' and 'Motherhood-related discrimination' are merged into 'Motherhood and menstruation related discrimination'; 'Mansplaining', 'Gaslighting', 'Religion-based sexism', 'Physical violence (excluding sexual violence)', and 'Other' are merged into 'Other'; 'Pay gap' and 'Hostile work environment (excluding pay gap)' are merged into 'Hostile work environment'; 'Tone policing', 'Moral policing (excluding tone policing)', and 'Victim blaming' are merged into 'Moral policing and victim blaming'; 'Rape' and 'Sexual assault (excluding rape)' are merged into 'Sexual assault'. Our dataset, however, retains all 23 categories. Fig. 1 shows the frequency distribution of the number of labels per account in the dataset, demonstrating the multi-label nature of instances of sexism.
Caution: 1) The category frequencies in our dataset (used for merging categories) do not represent the real-world frequencies of those categories of sexism, as they are affected by several factors including the bias of our sampling scheme toward smaller posts and the small size of our dataset compared to the immense degree of prevalence of sexism in the world. 2) Labeling categories of sexism can be complex in many cases. Hence, despite our best efforts, our labeled data may contain inaccuracies or discrepancies. We also recognize that our categorization scheme could be improved. Socially constructed false generalizations about certain roles being more appropriate for women; also applies to such misconceptions about men Attribute stereotyping Mistaken linkage of women with some physical, psychological, or behavioral qualities or likes/dislikes; also applies to such false notions about men Body shaming Objectionable comments or behaviour concerning appearance including the promotion of certain body types or standards Hyper-sexualization (excluding body shaming) Unwarranted focus on physical aspects or sexual acts

Internalized sexism
The perpetration of sexism by women via comments or other actions Pay gap Unequal salaries for men and women for the same work profile Hostile work environment (excluding pay gap) Sexism encountered by an employee at the workplace; also applies when a sexist misdeed committed outside the workplace by a co-worker makes working uncomfortable for the victim Denial or trivialization of sexist misconduct Denial or downplaying of sexist wrongdoings The promotion of discriminatory codes of conduct for women in the guise of morality; also applies to statements that feed into such codes and narratives Victim blaming The act of holding the victim responsible (fully or partially) for sexual harassment, violence, or other sexism perpetrated against her Slut shaming Inappropriate comments made about women 1) deviating from conservative expectations relating to sex or 2) dressing in a certain way when it gets linked to sexual availability Motherhood-related discrimination Shaming, prejudices, or other discrimination or misconduct related to the notion of motherhood; also applies to the violation of reproductive rights Menstruation-related discrimination Shaming, prejudices, or other discrimination or wrongdoings related to periods

Religion-based sexism
Sexist discrimination or prejudices stemming from religious scriptures or constructs Physical violence (excluding sexual violence) Domestic abuse, murder, kidnapping, confinement, or other physical acts of violence linked to sexism

Mansplaining
A woman being condescendingly talked down to by a man; also applies when a man gives an unsolicited advice or explanation to a woman related to something she knows well that she disapproves of Gaslighting Sexist manipulation of the victim through psychological means into doubting her own sanity Other Any type of sexism not covered by the above categories

Ethical Data Use and Release
We are committed to following ethical practices, which includes protecting the privacy and anonymity of the victims. We only use accounts of sexism and tags from entries on the Everyday Sexism Project website (ESP). The entry titles, which could contain sensitive information related to the names or locations of the victims (or contributors), are not saved or used at all.
Our dataset can be requested for academic purposes only by providing some prerequisites as recommended by an ethics committee and agreeing to certain terms through our website 2 . The requesters who fulfill these conditions will be emailed 1) the data comprising only numerical placeholders and labels, 2) a script that fetches only accounts of sexism from ESP to obtain the account for each placeholder, and 3) the annotation guidelines used. We have devised this method to ensure that if an entry gets removed from ESP by a victim (or contributor), any and all parts of it in our dataset will also be removed.

Sexism Categorization Approach
Given an account of sexism (post), our objective is to predict a list of up to 14 applicable categories of sexism, making this a multi-label multi-class classification task. In this section, we detail our proposed framework, which enables combining sentence representations derived from word embeddings using trainable model parameters with those obtained using general-purpose models. Our architecture is depicted in Fig. 2. We also discuss how we tap unlabeled data and loss functions.

Sexism Categorization Architecture
Let each post contain a maximum of |S| sentences with a maximum of |W | words per sentence.  Figure 2: Proposed sexism categorization architecture ery word (or sentence) can be represented using multiple word (or sentence) embedding methods. Let f (or g) be the number of word (or sentence) embedding methods chosen. Let d w i (or d s j ) be the embedding dimension for the i th (or j th ) word (or sentence) embedding scheme. Each post is represented using two kinds of tensors: (a) f tensors ∈ R |S|×|W |×d w i created using different word embeddings, and (b) g tensors ∈ R |S|×d s j constructed using different sentence encoders.
First, subsets of the f tensors based on word embeddings are concatenated in a configurable manner (configurable word-level concat in Fig. 2), producing p tensors ∈ R |S|×|W |×D w i , where D w i is the dimension resulting from the i th concatenation. Next, we construct vector representations for the sentences word-embedded in each of the p tensors using CNN-based and/or LSTM-based operations as configured. The CNN-based operations begin with convolutional filters being applied along the word dimension (Kim, 2014) to generate many bigram, trigram and 4-gram based features. This is followed by max-over-time pooling, which picks the largest value for each filter and produces a sentence representing tensor ∈ R |S|×c , where c is the total number of convolutional filters used. The LSTM-based components include biLSTM followed by an attention mechanism (Yang et al., 2016) through which the LSTM outputs across time steps are aggregated into a vector representation for each sentence, resulting in a tensor ∈ R |S|×h , where h is the bi-LSTM output length. At this stage, we have three types of sentence representing tensors if both CNN-based and RNN-based operations are chosen to be applied on all word embedding tensors: (a) p tensors ∈ R |S|×c from the CNN-based processing, (b) p tensors ∈ R |S|×h from the LSTM-based processing, and (c) g tensors ∈ R |S|×d s j obtained using general-purpose sentence encoders.
From these sentence representing tensors, subsets are concatenated to produce q tensors ∈ R |S|×D s j (configurable sentence-level concat in Fig. 2), where D s j is the dimension stemming from the j th concatenation. The sequence of sentence vectors in each of these q tensors is then passed through bi-LSTM followed by attention-based aggregation, producing q representations for a post collectively. These vectors are then concatenated to produce the overall post representation. The final step involves a fully connected layer with a sigmoid or softmax non-linearity depending on the loss function used, generating the output probabilities.

Word and Sentence Representations
We model a post using both word embeddings and sentence embeddings. We experiment with three distributional word vectors, namely ELMo (Peters et al., 2018), GloVe (Pennington et al., 2014), and fastText (Bojanowski et al., 2017), and a linguistic feature vector. Our linguistic feature representation comprises a variety of features, namely features from the biased language detection work (assertive verb, implicative verb, hedges, factive verb, report verb, entailment, strong subjective, weak subjective, positive word, and negative word) (Recasens et al., 2013), PERMA (Positive Emotion, Engagement, Relationships, Meaning, and Accomplishments) features for both polarities (Schwartz et al., 2016), associations with eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and two sentiments (negative and positive) from the NRC emotion lexicon (Mohammad and Turney, 2013), and affect (valence, arousal, and dominance) scores (Mohammad, 2018). Missing values are filled with zero for binary features and with the mean for non-binary ones.
We explore the following for creating sentence embeddings: BERT (Devlin et al., 2018), Universal Sentence Encoder (USE) (Cer et al., 2018), and InferSent (Conneau et al., 2017). Our choice of utilizing these models is warranted by the fact that the corpora that they are trained on are considerably bigger than the textual data that we have for supervised learning and hence likely contain greater semantic diversity.

Utilizing Unlabeled Data
Models such as BERT are not trained to generate representations tuned to a specific domain. We use over 90000 entries crawled from Everyday Sexism Project's website to tailor a pre-trained BERT model for obtaining more effective representations for our model. After removing the unlabeled entries corresponding to the posts in the test and validation data, we use the rest to tune the BERT parameters using its masked language modeling and next sentence prediction tasks. We henceforth refer to this refined model as tBERT (tuned BERT).

Loss Function Choice
Since the popular cross-entropy loss is inapt for our multi-label classification task in its standard form, we explore two alternatives. Binary (multihot) target vectors are used for both.

Extended Binary Cross Entropy Loss
We adopt an Extended version of the Binary Cross Entropy loss (EBCE), formulated as a weighted mean of label-wise binary cross entropy values in order to neutralize class imbalance.
(1) Here, n and L denote the number of samples (posts) and the number of classes respectively. y ij is 1 if label l j applies to post x i and 0 otherwise. p σ ij is the estimated probability of label l j being applicable to post x i computed using a sigmoid activation atop the fully connected layer with L output units. The weights for correcting class imbalance w jv are computed as follows.

Normalized Cross Entropy Loss
We also experiment with a Normalized variant of the Cross Entropy loss tailored for a multi-label problem configuration also mitigating class imbalance (referred to as NCE).
Here, y + i is the set of labels applicable to post x i .p ij denotes the estimated probability of label l j being applicable to post x i computed through a softmax function. The class imbalance negating weights w c j are generated as follows.
Unlike in single-label multi-class classification, wherein arg max can be applied to the probability vector generated by softmax to make the prediction, we could apply a threshold on probability vectorp i to predict (potentially) multiple classes for post x i . Instead of using a fixed, manually tuned threshold-related parameter, we devise an automated method for estimating a persample cut-off position. For each sample, we descendingly sort the probability scores, compute the differences between successive (sorted) score pairs, find the index m corresponding to the maximum value in the list of differences, and select the classes corresponding to indices [1..m]. Note that when sigmoid (along with EBCE loss) is used instead of softmax, the prediction is made by rounding the probability vector, since it comprises the class-wise binary prediction probabilities.

Discussion on Single-label Transformations
Traditional approaches to multi-label classification include transforming the problem to one or more single-label classification problems. The Label Powerset (LP) method (Boutell et al., 2004) treats each distinct combination of classes existing in the training set as a separate class. The standard cross-entropy loss can then be used along with softmax. This transformative method may impose a greater computational cost than the direct approach using the EBCE loss since the cardinality of the transformed label set may be relatively high. Moreover, LP does not generalize to label combinations not covered in the training set. Another approach based on problem transformation is binary relevance (BR) (Boutell et al., 2004). An independent binary classifier is trained to predict the applicability of each label in this method. This entails training a total of L classifiers, making BR computationally very expensive. Additionally, its performance is affected by the fact that it disregards correlations existing between labels.
We evaluate the proposed framework against several baselines and provide qualitative and quantitative analyses. Our code is available on GitHub 3 . Our implementation utilizes parts of the code from (Agrawal and Awekar, 2018;Pattisapu et al., 2017;Liao, 2017) and libraries Keras and Scikitlearn (Pedregosa et al., 2011). We reserve 15% of the data for testing and validation each.

Evaluation Metrics
Owing to the multi-label nature of this classification, standard metrics used in single-label multiclass classification are unsuitable. We adopt established example (instance) based metrics, namely F1 (F I ) and accuracy (Acc I ), and label-based metrics, namely F1 macro (F macro ) and F1 micro (F micro ) used in multi-label classification (Zhang and Zhou, 2014).

Baselines
Random Labels are selected randomly as per their normalized frequencies in the training data for each test sample. Traditional Machine Learning (ML) We experiment with Support Vector Machine (SVM), Random Forests (RF), and Logistic Regression (LR). The features explored include TF-IDF on character n-grams (1-5 characters), TF-IDF on word unigrams and bigrams, the mean of the ELMo vectors for the words in a post, and the composite set of features similar to (Anzovino et al., 2018) comprising n-gram based, POS-based, and doc2vec (Le and Mikolov, 2014) features, the post length, and the adjective count. LSTM-based Architectures biLSTM: The word embeddings for all words in a post are fed to bidirectional LSTM.
biLSTM-Attention: Same as biLSTM but with the attention mechanism by Yang et al. (2016).
Hierarchical-biLSTM-Attention: For the words in each sentence, the word embeddings are passed through biLSTM with attention to create a sentence embedding. These sentence embeddings are in turn fed to another instance of biLSTM with attention. This broadly follows the architecture proposed for document classification by Yang et al. (2016) with GRUs replaced with LSTMs.

CNN and CNN-LSTM based Architectures
CNN-Kim: Similar to (Kim, 2014), this involves applying convolutional filters followed by max-over-time pooling to the word vectors for a post.
C-biLSTM: In this variant of the C-LSTM architecture (Zhou et al., 2015) somewhat related to an approach used by Karlekar and Bansal (2018) for multi-label sexual harassment classification, after applying convolution on the word vectors for a post, the feature maps are stacked along the filter dimension to create a sequence of window vectors, which are then fed to biLSTM.
CNN-biLSTM-Attention: For each sentence, convolutional and max-over-time pooling layers are applied on the embeddings of its words. The resultant sentence representations are put through bi-LSTM with the attention mechanism. This approach is similar to (Wang et al., 2016) with the attention scheme from (Yang et al., 2016) added.
The architectures of the deep learning baselines have a fully connected layer with the sigmoid or softmax non-linearity (depending on the loss function used) at the end. Table 2 shows results produced using traditional ML methods (SVM, RF, and LR) across four different feature sets (word n-grams, character ngrams, averaged ELMo vectors, and composite features). We use Label Powerset for these methods, since the direct (non-transformative) formulation cannot be used with them. Among these combinations, logistic regression with averaged ELMo embeddings as features performs the best. Table 3 contains results for the random and deep learning baselines and different variants of the proposed framework. For each method, the average over three runs is reported for each metric. We find ELMo to be better than GloVe and fast-Text for word embeddings across multiple baselines and hence show only ELMO-based results for the baselines. We report all results with the EBCE loss; the NCE loss produced inferior results across multiple methods. For our framework, s() denotes sentence-level concatenation; wl() de- notes word level concatenation and LSTM-based processing; wc() denotes word level concatenation and CNN-based processing. We note that the results are reported for only some of the many instances that can arise from our configurable architecture. Our framework provides the ability to explore different configurations such as those with multiple s() operations, depending on the problem at hand.

Results
We observe the following: (1) The random baseline performs poorly, confirming the complexity of the problem. (2) biLSTM-Attention and Hierarchical-biLSTM-Attention are the two best baselines. (3) Several variants of the proposed framework outperform all baselines. Based on F I and F macro , our best method is s(wl(ELMo), wl(GloVe), tBERT), though adding linguistic features (Ling) to it slightly improves some metrics.
(4) BERT tuned on unlabeled instances of sexism (tBERT) works better than the vanilla BERT counterpart and other sentence encoders. (5) Combining tBERT sentence representations with those generated from ELMo word vectors using biL-STM with attention works better than using either individually. (6) Along with tBERT, concatenating ELMo and GloVe at the sentence level (s(wl(ELMo), wl(GloVe), tBERT)) is better than concatenating them at the word level (s(wl(ELMo, GloVe), tBERT)) while processing word vectors using biLSTM with attention. (7) The LSTM based processing of word embeddings produces  The pre-processing steps that we perform for all deep learning methods include removing certain non-alpha-numeric characters and extra spaces, lower-casing, and zero-padding input tensors as appropriate. While breaking a post into sentences, each sentence containing more than 35 words is split into multiple sentences, ensuring the maximum sentence length of 35 words.
Using experiments on a validation set, which was merged into the training set during the test runs, for each method, we choose the values of three hyper-parameters: the LSTM dimension, the attention dimension, and the number CNN filters for kernel sizes 2, 3, and 4 each. The values used for instances of our framework and the deep learning baselines are provided in Table 5.
We employ 0.25 dropouts after each input and before the final, fully connected layer. The learning rate was set to 0.001 and the number of epochs to 10. We use a batch size of 64. These fixed parameters were kept unchanged across methods.
The hyper-parameter values for the traditional ML methods are as follows. For SVM, soft margin (C) is set to 1.0. For RF (Random Forest), the number of estimators is 100. For extracting character and word n-grams, the maximum number of features used, word n-gram range, and character n-gram range are 10000, (1,2), and (1,5) respectively. For SVM and LR (Logistic Regression), we apply class imbalance correction.
For tapping unlabeled data, we pre-train the 'BERT-Base, Uncased' model 4 for 100000 steps with a batch size of 25. For vannila BERT, we use the bigger 'BERT-Large, Uncased' model, which we could not use for pre-training because of computational constraints. For generating GloVe word embeddings, we use the 840B-token, cased model. Table 4 lists accounts of sexism from the test set for which our best method made the right predictions but the best baseline did not, along with the labels. It also highlights the top two words per sentence based on the word-level attention weights for wl(ELMo) and wl(GloVe) combined through element-wise max operations. For the first ac- Role stereotyping, Hostile work environment, Sexual harassment (excluding assault) I didn't appreciate it when my own father walked into the house one day while I was doing laundry and told me that "it's nice to see you finally doing women's work." (womens, work) Role stereotyping, Moral policing and victim blaming being told I should take cat calls as compliments by my father (compliments, cat) Denial or trivialisation of sexist misconduct, Sexual harassment (excluding assault) Referred to as 'not a girl' because I have short hair and don't wear noticeable makeup.
(makeup, hair) Attribute stereotyping, Body shaming At our school girls are forbidden to wear tight trowsers or remove their blazers-in fear of distracting the boys.
(tight, trowsers) Moral policing and victim blaming, Hyper-sexualization (excluding body shaming)  count of sexism, our model produces words like "intern", "assistant" and "boss", associated with role stereotyping, among the top two words across sentences. Likewise, "honey" related to sexual harassment and "boss", "coworker", and "services" related to hostile work environments also surface. Moreover, the top two sentences based on the sentence-level attention weights of our model are the last two, which evidence all category labels. For other posts too, the model produces categoryrelevant top two words per sentence; "womens" and "work" relate to role stereotyping and moral policing; "tight" and "trowsers" relate to hypersexualization and moral policing; "makeup" relates to attribute stereotyping; "hair" from "short hair" relates to body shaming; "cat" from "cat calls" relates to sexual harassment; "compliments" from "take cat calls as compliments" relates to denial or trivialization of sexist misconduct. Table 6 shows results for one run of our best method across different numbers of labels per post (1 to 5). Entries for values 6 and 7, which have less than 10 associated test samples, are omitted.
The best results are observed for values 2 to 4, suggesting that our approach performs better on multi-label samples.