A Metric Learning Approach to Misogyny Categorization

The task of automatic misogyny identification and categorization has not received as much attention as other natural language tasks have, even though it is crucial for identifying hate speech in social Internet interactions. In this work, we address this sentence classification task from a representation learning perspective, using both a bidirectional LSTM and BERT optimized with the following metric learning loss functions: contrastive loss, triplet loss, center loss, congenerous cosine loss and additive angular margin loss. We set new state-of-the-art for the task with our fine-tuned BERT, whose sentence embeddings can be compared with a simple cosine distance, and we release all our code as open source for easy reproducibility. Moreover, we find that almost every loss function performs equally well in this setting, matching the regular cross entropy loss.


Introduction
Whether it is at the word or at the sentence level, learning robust representations allows neural networks to consolidate knowledge that can later be transferred to other tasks and domains. Many approaches have dealt with this problem in different ways, for instance with CBOW or skip-gram from word2vec (Mikolov et al., 2013) for contextindependent word embeddings, or more recently with BERT's (Devlin et al., 2019) sentence embeddings and contextual word embeddings.
In order to learn sentence representations, a neural encoder enc needs to learn a mapping from an initial representation x i to a target vector space. In a metric learning approach, the distances between each pair of sentence embeddings (enc(x i ), enc(x j )) should be low if classes y i = y j (intra-class compactness) and high if y i = y j (interclass separability). To achieve this objective, the angle θ ij separating a pair of embeddings (as depicted in Figure 1) can be used to redefine the model's loss function.
In the domain of face recognition, many loss functions (Schroff et al., 2015;Wen et al., 2016;Liu et al., 2017;Wang et al., 2018;Deng et al., 2019) have been proposed to learn better face representations, motivated by high intra-class variability due to lighting, position or background. Other studies have experimented with these methods in different domains with similar characteristics, like speaker verification (Bredin, 2017;Chung et al., 2018;Yadav and Rai, 2018), and even as an enhancement of BERT's sentence representations (Reimers and Gurevych, 2019) for semantic textual similarity. A recent study (Srivastava et al., 2019) has also focused on comparing these methods on face verification, showing that angular margin losses achieve superior performance.
On the other hand, the automatic misogyny identification (AMI) evaluation campaign (Fersini et al., 2018a) was proposed to address misogyny on tweets. Included tasks were identification (i.e. misogynous or not), categorization over five different misogyny types, and target identification (to an individual or a group). However, no participant has proposed a metric learning model. The best system (Ahluwalia et al., 2018) uses a bidirectional LSTM with word embeddings of size 100 for the identification task, and ensemble methods with feature engineering for category and target classification. They achieve a macro F1 score of 36.1 on the misogyny categorization part of sub-task B, which is the one we address as well. A different architecture (Caselli et al., 2018) uses a multi-layer character bidirectional LSTM for categorization, obtaining a macro F1 score of 14.1.
In this paper, we focus on five metric learning losses for the task of misogyny categorization, using the AMI (Fersini et al., 2018a) dataset. Our hypothesis was that metric learning might reduce the natural intra-class variability within misogyny categories, making representations robust to writing styles, irony, insults, etc. The loss functions we experiment with are contrastive loss (Hadsell et al., 2006), triplet loss (Schroff et al., 2015), center loss (Wen et al., 2016), congenerous cosine loss (Liu et al., 2017) and additive angular margin loss (Deng et al., 2019), as well as cross entropy loss. We optimize these loss functions with two different architectures: a bidirectional LSTM (Hochreiter and Schmidhuber, 1997) and BERT (Devlin et al., 2019), and we evaluate their performance using a simple K-nearest neighbors (KNN) classifier to better measure representation quality.
Our main contributions consist of new state-ofthe-art performance for the misogyny categorization task, as well as empirical evidence that these methods do not perform better than cross entropy loss on closed-set sentence classification. Moreover, our code is released as open source for easy reproducibility.

Loss Functions
In this section, we present the loss functions chosen for our study, which can be separated into contrastbased and classification-based, according to how they are computed.

Contrast-based losses
The contrastive loss (Hadsell et al., 2006) uses pairs annotated as similar/dissimilar (also called positive/negative). It brings representations from similar examples closer together, while separating dissimilar ones explicitly: where P + is the number of similar pairs, P − the number of dissimilar pairs, D i = 1 − cos θ i the distance between embeddings of the ith pair, and m a margin. The triplet loss (Schroff et al., 2015) is calculated over triplets composed of a reference example known as the anchor, a positive and a negative, both the latter with respect to the anchor. Following the idea introduced by Gelly and Gauvain (2017), we define this loss using the sigmoid function: where T is the number of triplets, α a scaling hyperparameter, θ p i the angle separating the anchor and the positive embeddings, and θ n i the angle separating the anchor and the negative ones.
Taking Figure 1 as an example, contrast-based losses encourage the cosine distance between embeddings i and j to be larger if y i = y j , and smaller if y i = y j . This is achieved a single pair at a time with contrastive loss, while triplet loss does it jointly using both the positive and negative inside the triplet.

Classification-based losses
These loss functions derive from the cross entropy loss, either by modifying how the classification layer output is calculated or working as a penalization term. The cross entropy loss is defined as: where N is the number of training examples, σ i the output of the classification layer, and y i the class of the ith example. The congenerous cosine (CoCo) loss (Liu et al., 2017) interprets the weights w k of the classification layer as class centroids, learning to maximize the cosine similarity between a representation and its centroid. The classification layer output σ i is redefined as: where θ iw k is the angle separating the ith representation and w k , and α a scaling hyper-parameter.
The additive angular margin (AAM) loss (Deng et al., 2019) goes one step further adding a margin in angular space to penalize the distance between a representation and its centroid: where m is a margin, and δ ik = 1 if k = y i and 0 otherwise. Finally, the center loss (Wen et al., 2016) penalizes the cross entropy loss with the distance to jointly learned centroids c k external to the classification layer: where λ is a hyper-parameter controlling the effect of penalization.
To see the effect of classification-based losses more intuitively, consider embeddings and centers in Figure 1. If y i = k, then both congenerous cosine loss and center loss will penalize the loss value with the distance from embedding i to w k (or c k in the case of center loss), hence bringing all vectors from class k close to the centroid k. The additive angular margin loss follows the same principle, but penalizing further by artificially augmenting the distance of embedding i to w k with the angular margin.

Task
The term misogyny is defined as hatred towards women. Hate speech of this nature is unfortunately common in social Internet interactions, and current language models are generally unable to accurately detect and classify it. The AMI task and corpus were proposed in the context of the IberEval 2018 (Fersini et al., 2018b) and Evalita 2018 (Fersini et al., 2018a) evaluation campaigns, allowing researchers to train models focused specifically on misogyny. The corpus consists of an ensemble of tweets with three different types of annotations: misogyny (binary), misogyny category and target (active or passive).
We use the same dataset as in Fersini et al. (2018a) and we focus exclusively on misogyny categorization, using an additional class for non misogynous tweets. Our results are thus compared to the categorization part of sub-task B. An explanation of misogyny categories according to the definitions given in Fersini et al. (2018a) can be found in Table 2.

Class
Train Dev Test   derailing  74  18  11  discredit  811  203  141  dominance  118  30  124  sexual harassment 282  70  44  stereotype  143  36  140  non misogynous  1,772 443  540  total 3,200 800 1,000 As the corpus does not provide a development set, one was constructed from the training set following the same class distribution. The final Train set is composed of 3200 tweets, and the Dev and Test sets of 800 and 1000 tweets respectively. Class distribution is described in detail in Table 1. The task is evaluated using the macro F1 score.

Experimental protocol
As different losses rely on different hyperparameters, we perform a hyper-parameter search including learning rates, margins m, scalings α, and λ. The values we have experimented with are shown in Table 3. Each configuration is trained on Train for 60 epochs and validated using a KNN classifier on Dev. As we deal with a rather small dataset, the best configuration for each loss and each architecture is then trained and validated from scratch 10 times to reduce the effect of randomness. Reported results are the mean macro F1 score and standard deviation on Test over these 10 runs.
In all experiments we use the cosine distance to compare embeddings, as congenerous cosine loss and additive angular margin loss can only be optimized in this way. Additionally, a linear classification layer is jointly trained with the sentence encoder when optimizing classification-based loss functions.

Architecture
We experiment with two different encoder architectures. The first one is a one-layer bidirectional LSTM (Hochreiter and Schmidhuber, 1997) with output size 768 (to match BERT) and word embeddings of size 300 obtained from a word2vec CBOW model (Mikolov et al., 2013) trained on 2billion-word Wikipedia dumps. The second one is Category Description Example derailing "to justify women abuse, "if rape is real why aren't more people rejecting male responsibility" reporting it? just another feminist lie" discredit "slurring over women with "this b*** is a s***" no other larger intention" dominance "to assert the superiority of men "#didyouknow the male brain is 3.4 times larger over women to highlight gender inequality" than the female brain? #maledominance" sexual "sexual advances, harassment of "come on box I show you my c*** darling" harassment a sexual nature, etc." stereotype "a widely held but fixed and "these people are hysterical. it's like a commercial oversimplified image or idea of a woman" for why men should never marry [. . . ]"  To obtain a sentence embedding from an encoder, we perform a max pooling over the hidden states of the last layer, leaving us with sentence embeddings of size 768 on both models.

Implementation details
All sentences are pre-tokenized using the TweetTokenizer from the NLTK toolkit (Bird et al., 2009) in order to correctly deal with Twitterspecific tokens like hashtags, mentions, and even emojis. During this process we remove handles and URLs. When training BERT, we do a second pass of tokenization with BERT's pretrained tokenizer. We use a batch size of 32 sentences and RMSprop as optimizer, reducing the learning rate by half every 5 epochs of no improvement. The best configurations found during hyper-parameter search for each architecture and loss function are shown in Table 4.
Our code is released as open source, available at github.com/juanmc2005/MetricAMI.

Evaluation
We evaluate each model with the macro F1 score of a KNN classifier with K = 10 fit with all sentence embeddings from Train. However, given the high class imbalance, the a priori probability of a random embedding being closer to a non-misogynous embedding is higher than for a discredit one (see Table 1). To circumvent this issue, we penalize the vote for class k by the number of examples from k in Train. We believe this simple classifier to be a better measure for representation quality, as it relates to the separability and compactness properties that we expect from a metric learning model.

Results
The results are summarized in Figure 2. With a fixed architecture, it is clear that all loss functions perform equally, with the exception of LSTM with contrastive and triplet loss. As the LSTM encoder is rather shallow (4.4M parameters) in comparison to BERT (110M parameters), it is possible that contrast-based losses need bigger models to perform competitively.
The fact that almost all losses perform equally well shows that, contrary to what we thought, metric learning models perform no better than cross entropy, in contrast to other findings (Srivastava et al., 2019) on face verification. One possible explanation is that the AMI dataset may not contain enough examples or classes for these models to exploit. However, another factor might be responsible for this behavior. One of the key differences of AMI with respect to face verification is the closedset nature of the problem. An open-set task is evaluated with unseen classes, while a closed-set task is evaluated with unseen instances of the train-

Loss
Hyper-parameters  ing classes. It is possible that open-set verification tasks are more suitable for metric learning than closed-set tasks, meaning that the power of metric learning might in fact lie in generalizing to unseen classes rather than unseen class instances. The fact that verification tasks more closely resemble the training objective than exact class prediction could provide an explanation for this. On the other hand, our fine-tuned BERT outperforms the Evalita winner baseline (Ahluwalia et al., 2018), setting new state-of-the-art for misogyny categorization, with the added benefit of having comparable embeddings with a simple cosine distance.
As a final note, results in Table 4 suggest that congenerous cosine loss and center loss hyperparameters could be more sensitive to architecture changes than other losses, as they are the only ones whose best configurations differ from one architecture to the other. Perhaps not surprisingly, we also observe that additive angular margin loss works better with lower margins. This is consistent with the margin's role, serving as an upper bound for the distance between an embedding and its centroid, while the margin in contrastive loss serves as a lower bound for the distance between two negatives.

Conclusion
In this work we have addressed the problem of misogyny categorization from a metric learning perspective, comparing the performance of sev- eral loss functions. We hypothesized that reducing intra-class variability in this way would be beneficial. However, we have shown that none of the considered losses can outperform the regular cross entropy on the task. Our results suggest that metric learning approaches might not be suited to closedset sentence classification tasks. Finally, our fine-tuned BERT sets new state-ofthe-art performance, with a macro F1 score of 40.5.