Generating Counter Narratives against Online Hate Speech: Data and Strategies

Recently research has started focusing on avoiding undesired effects that come with content moderation, such as censorship and overblocking, when dealing with hatred online. The core idea is to directly intervene in the discussion with textual responses that are meant to counter the hate content and prevent it from further spreading. Accordingly, automation strategies, such as natural language generation, are beginning to be investigated. Still, they suffer from the lack of sufficient amount of quality data and tend to produce generic/repetitive responses. Being aware of the aforementioned limitations, we present a study on how to collect responses to hate effectively, employing large scale unsupervised language models such as GPT-2 for the generation of silver data, and the best annotation strategies/neural architectures that can be used for data filtering before expert validation/post-editing.


Introduction
Owing to the upsurge in the use of social media platforms over the past decade, Hate Speech (HS) has become a pervasive issue by spreading quickly and widely.Meanwhile, it is difficult to track and control its diffusion, since nuances in cultures and languages make it difficult to provide a clear-cut distinction between hate and dangerous speeches (Schmidt and Wiegand, 2017).The standard approaches to prevent online hate spreading include the suspension of user accounts or deletion of hate comments from the social media platforms (SMPs), paving the way for the accusation of censorship and overblocking.Alternatively, to weigh the right to freedom of speech, shadow-banning has been put into use where the content/account is not deleted but hidden from SMP search results.Still we believe that we must overstep reactive identifyand-delete strategies, to responsively intervene in the conversations (Bielefeldt et al., 2011;Jurgens et al., 2019).In this line of action, some Non-Govermental Organizations (NGOs) train operators to intervene in online hateful conversations by writing counter-narratives.A Counter-Narrative (CN) is a non-aggressive response that offers feedback through fact-bound arguments and is considered as the most effective approach to withstand hate messages (Benesch, 2014;Schieb and Preuss, 2016).To be effective, a CN should follow guidelines similar to those in 'Get the Trolls Out' project 1 , in order to avoid escalating the hatred in the discussion.
Still, manual intervention against hate speech is not scalable.Therefore, data-driven NLG approaches are beginning to be investigated to assist NGO operators in writing CNs.As a necessary first step, diverse CN collection strategies have been proposed, each of which has its advantages and shortcomings (Mathew et al., 2018;Qian et al., 2019;Chung et al., 2019).
In this study, we aim to investigate methods to obtain high quality CNs while reducing efforts from experts.We first compare data collection strategies depending on the two main requirements that datasets must meet: (i) data quantity and (ii) data quality.Finding the right trade-off between the two is in fact a key element for an effective automatic CN generation.To our understanding none of the collection strategies presented so far is able to fulfill this requirement.Thus, we test several hybrid strategies to collect data, by mixing niche-sourcing, crowd-sourcing, and synthetic data generation obtained by fine-tuning deep neural architectures specifically developed for NLG tasks, such as GPT-2 (Radford et al., 2019).We propose using an author-reviewer framework in which an author is tasked with text generation and a reviewer can be a human or a classifier model that filters the produced output.Finally, a validation/post-editing phase is conducted with NGO operators over the filtered data.Our findings show that this framework is scalable allowing to obtain datasets that are suitable in terms of diversity, novelty, and quantity.

CN Collection Approaches
Three prototypical strategies to collect HS-CN pairs have been presented recently.
Crawling (CRAWL).Mathew et al. (2018) focuses on the intuition that CNs can be found on SMPs as responses to hateful expressions.The proposed approach is a mix of automatic HS collection via linguistic patterns, and a manual annotation of replies to check if they are responses that counter the original hate content.Thus, all the material collected is made of natural/real occurrences of HS-CN pairs.Crowdsourcing (CROWD).Qian et al. (2019) propose that once a list of HS is collected from SMPs and manually annotated, we can briefly instruct crowd-workers (non-expert) to write possible responses to such hate content.In this case the content is obtained in controlled settings as opposed to crawling approaches.
Nichesourcing (NICHE).The study by Chung et al. (2019) still relies on the idea of outsourcing and collecting CNs in controlled settings.However, in this case the CNs are written by NGO operators, i.e. persons specifically trained to fight online hatred via textual responses that can be considered as experts in CN production.

Characteristics of the Datasets
Regardless of the HS-CN collection strategy, datasets must meet two criteria: quality and quantity.While quantity has a straightforward interpretation, we propose that data quality should be decomposed into conformity (to NGOs guidelines) and diversity (lexical & semantic).Additionally, HS-CN datasets should not be ephemeral, which is a structural problem with crawled data since, due to copyright limitations, datasets are usually distributed as a list of tweet IDs (Klubička and Fernández, 2018).With generated data (crowdsourcing or nichesourcing) the problem is avoided.Quantity.While the CRAWL dataset is very small and ephemeral, representing more a proof of concept than an actual dataset, the CROWD dataset involved more than 900 workers to produce ≈ 41K CNs, while the NICHE dataset is constructed by the participation of ≈ 100 expert-operators to obtain ≈ 4K pairs (in three languages) and resorted to HS paraphrasing and pair translation to obtain the final ≈ 14K HS-CN pairs.Evidently, employing non-experts, e.g, crowdworkers or annotators, is preferable in terms of data quantity.Quality.In terms of quality, we consider that diversity is of paramount importance, since verbatim repetition of arguments can become detrimental for operator credibility and for the CN intervention itself.Following Li et al. (2016), we distinguish between (i) lexical diversity and (ii) semantic diversity.While lexical diversity focuses on the diversity in surface realization of CNs and can be captured by word overlapping metrics, semantic diversity focuses on meaning and is harder to be captured, as in the case of CNs with similar meaning but different wordings (e.g.,"Any source?" vs. "Do you have a link?").(i) Semantic Diversity & Conformity.To model semantic diversity and conformity, we focus on the CN 'argument' types that are present in various datasets.Argument types are useful in assessing content richness (Hua et al., 2019).In a preliminary analysis, CROWD CNs are observed to be simpler and mainly focus on 'denouncing' the use of profanity while NICHE CNs are found richer with a higher variety of arguments.On the other hand, CRAWL CNs can cover diverse arguments to a certain extent while being highly prone to contain profanities.To perform a quantitative comparison, we randomly sampled 100 pairs from each dataset and annotated them according to the CN types presented by Benesch et al. (2016) The results are reported in Table 1.For the sake of conciseness we focus on the hostile, denouncing and consequences classes, giving other to all remaining types (including the fact class).Clearly, CRAWL does not meet the conformity standards of CNs considering the vast amount of hostile responses (50%), still granting a certain amount of type variety (other: 34%).Contrarily, CROWD conforms to the CN standards (hostile: 0%), yet mostly focuses on pure denouncing (76%) or denouncing with simple arguments (10%).The class other (14%) consists of almost only simple arguments, such as "All religions deserve tolerance".In NICHE instead, arguments are generally and expectedly more complex and articulated, and represent the vast majority of cases (81%).Few examples of CN types are given in Table 2. (ii) Lexical Diversity.The Repetition Rate (RR) is used to measure the repetitiveness of a collection of texts, by considering the rate of non-singleton ngram types it contains (Cettolo et al., 2014;Bertoldi et al., 2013).We utilize RR instead of the simple count of distinct ngrams (Xu et al., 2018;Li et al., 2016) or the standard type/token ratio (Richards, 1987) since it allows us to compare corpora of diverse sizes by averaging the statistics collected on a sliding window of 1000 words.Since CROWD and NICHE contain repeated CNs for different HSs2 , we first removed repeated CNs and then applied a shuffling procedure to avoid that CNs that are answering to the same HS (so more likely to contain repetitions) appear close together.Results in Table 1 show that NICHE is the dataset with more lexical diversity (lower RR), followed by CRAWL and CROWD.Discussion.We can reasonably conclude that: (i) crawling, as presented in (Mathew et al., 2018), is not a mature procedure yet for CN collection, even if it is promising, (ii) nichesourcing is the one producing the best and most diverse material by far, however it is also the most challenging to implement considering the difficulty of making agreements with NGOs specialized in CN creation and it does not provide sufficient amount of data.(iii) On the contrary, CROWD seems the only one that can grant the amount of data that is needed for deep learning approaches, but contains more simple and stereotyped arguments.A summary of the pros and cons of each collection approach is presented in Table 3.

CN Generation through
Author-Reviewer Architecture Since none of the aforementioned approaches alone can be decisive for creating proper CN datasets, we propose a novel framework that mixes crowdsourcing and nichesourcing to obtain new quality data while reducing collection cost/effort.The key elements of this mix are: (i) There must be an external element in this framework that produces HS-CN candidates, (ii) Non-experts should pre-filter the material to be presented/validated by experts.Thus, we settle on the author-reviewer modular architecture (Oberlander and Brew, 2000;Manurung et al., 2008).In this architecture the author has the task of generating a text that conveys the correct propositional content (a CN), whereas the reviewer must ensure that the authors output satisfies certain quality properties.The reviewer finally evaluates the text viability and picks the ones to present to the NGO operators for final validation/post-editing.
Hostile "Hell is where u belong!Stupid f***t... go hang yourself!!" Denouncing "The N word is unacceptable.Please refrain from future use."Fact "The majority of sexual assaults are committed by a family member, friend, or partner of the victim, and only 12% of convicted rapists are Muslim.It is not the religion, its the individuals, whether they're Muslim or not."-Niche.
-  The author-reviewer architecture that we propose differs from the previous studies in two respects: (i) it is used for data collection rather than for NLG, (ii) we modified the original configuration by adding a human reviewer and a final postediting step.
We first tested four different author configurations, then three reviewer configurations keeping the best author configuration constant.A representation of the architecture is shown in Figure 1.

The Author: Generation Approaches
In order to obtain competent models that can provide automatic counter-narrative hints and suggestions to NGO operators, we have to overcome data bottleneck/limitations, i.e. either the limited amount of training data in NICHE or its repetitiveness in CROWD, especially for using neural NLP approaches.Pre-trained Language Models (LMs) have achieved promising results when fine-tuned on challenging generation tasks such as chit-chat dialog (Wolf et al., 2019;Golovanov et al., 2019).To this respect, we propose using the recent large-scale unsupervised language model GPT-2 (Radford et al., 2019), which is capable of generating coherent text and can be fine-tuned and/or conditioned on various NLG tasks.It is a large transformer-based (Vaswani et al., 2017) LM trained on a dataset of 8 million web pages.We used the medium model, which was the largest available during our experimentation and contains 345 million parameters, with 24 layers, 16 attention heads, and embedding size of 1024.We fine-tuned two models with GPT-2, one on NICHE and one on CROWD datasets, for counter-narrative generation.
NICHE -Training and test data.We have split 5366 pairs of HS-CN for training and the rest (1288 pairs) for testing.In particular, the original HS-CN pairs, one HS paraphrase, and the pairs translated from FR and IT were kept for training while the other HS paraphrases were used for testing.See Chung et al. (2019) for further details.
CROWD -Training and test data.Although the CROWD dataset was created for dialogue level HS-CN, we could extract HS-CN pairs by selecting the dialogues in which only 1 utterance was labeled as HS.Therefore, we could guarantee that the crowd-produced CNs are exactly for the labeled utterance.We then applied a 80/20 training and test split, obtaining 26320 and 6337 pairs.Generation Models.We fine-tuned GPT-2 with CROWD dataset we selected the checkpoint at the 5000 th step.After fine-tuning the models the generation of CNs for the test HSs has been performed using Nucleus Sampling (Holtzman et al., 2019) with a p value of 0.9, which provides an enhanced diversity on the generation in comparison to the likelihood maximization decoding methods while preserving the coherency by truncating the less reliable tail of the distribution.At the test time, the input HSs are fed into models as conditions, which are used as the initial contexts while sampling the next tokens.Given an input HS, the models produce a chunk of text which is a list of HS-CN pairs of which the first sequence marked with [CN start token] CN [CN end token] is the generated output.
Baselines.In addition to the fine-tuned GPT-2 models, we also evaluate two baseline models.Considering the benefits of transformer architectures on parallelization and learning long-term dependencies over recurrent models (Vaswani et al., 2017), we have implemented the baseline models using transformer architecture.The models have been trained similar to the base model described by Vaswani et al. (2017) with 6 transformer layers, batch size of 64, 100 epochs, 4000 warmup steps, input/output dimension of 512, 8 attention heads, inner-layer dimension of 2048, drop-out rate of 0.1.We used Nucleus Sampling also for the baselines with a p value of 0.9 during decoding.
In brief, we have trained four different configurations/models as authors: 1. TRF crowd : baseline on CROWD dataset 2. GPT crowd : fine-tuned GPT-2 on CROWD dataset 3. TRF niche : baseline on NICHE dataset 4. GPT niche : fine-tuned GPT-2 on NICHE dataset Metrics.We report both standard metrics (BLEU (Papineni et al., 2002) 4. In terms of BLEU and BertScore, baseline models yield a better performance.However, a few peculiarities of CN generation task and the experiment settings hinder the direct and objective comparison of the presented scores among the models.First, gathering a finite set of all possible counter-narratives for a given hate speech is a highly unrealistic target.Therefore, we have only a sample of proper CNs for each HS, which is a possible explanation of very low scores using the standard metrics.Second, the train-test splits of NICHE dataset contain same CNs since the splitting has been done using one paraphrase for each HS and its all original CNs, while CROWD train-test splits have a similar property since an exact same CN can be found for many different HSs.Consequently, the non-pretrained transformer models, which are more prone to generating an exact sequence of text from the training set, show a relatively better performance with the standard metrics in comparison to the advanced pre-trained models.Some randomly sampled CNs, generated by the various author configurations, are provided in Appendix.
Regarding the generation quality, we observe that baseline models cannot achieve the diversity achieved by GPT-2 models in terms of RR -both for NICHE and CROWD (4.89 vs 3.23, and 8.93 vs. 5.89).Moreover, GPT-2 provides an impressive boost in novelty (0.04 vs 0.46 and 0.10 vs 0.70).Among the GPT-2 models, the quality scores (in terms of RR and novelty) of the CNs generated by GPT niche are more than double in comparison to those generated with GPT crowd .
With regard to the overall results, GPT niche is the most promising configuration to be employed as author.In fact, we observed that, after the output CN, the over-generated chunk of text consists of semantically coherent brand-new HS-CN pairs, marked with proper HS/CN start and end tokens consistent with the training data representation.Therefore, on top of CN generation for a given HS, we can also take advantage of the over-generation capabilities of GPT-2, so that the author module can continuously output plausible HS-CN pairs without the need to provide the HS to generate the CN response.This expedient allows us to avoid the ephemerality problem for HS collection as well.
To generate HS-CN pairs with the author module, we basically exploited the model test setting and conditioned the fine-tuned model with each HS in the NICHE test-set.After removing the CN output for the test HS, we could obtain new pairs of HS-CN.In this way, we generate 2700 HS-CN pairs that we used for our reviewer-configuration experiments.

The Reviewer
The task of the reviewer is a sentence-level Confidence Estimation (CE) similar to the one of Machine Translation (Blatz et al., 2004).In this task, the reviewer must decide whether the author output is correct/suitable for a given source text, i.e. a hate speech.Consistently with the MT scenario, one application of CE is filtering candidates for possible human post-editing, which is conducted by the NGO operator by validating the CN.We tested three reviewer configurations: 1. expert-reviewer: Author output is directly presented to NGO operators.
2. non-expert-reviewer: Author output is filtered by human reviewers, then validated by operators.
3. machine-reviewer: Filtering is done by a classifier neural-architecture before operator validation.

Human Reviewer Experiment
In this section we describe the annotation procedure for the non-expert reviewer configuration.
Setup.We administered the generated 2700 HS-CN pairs to three non-expert annotators, and instructed them to evaluate each pair in terms of CN 'suitableness' with regard to the corresponding hate speech.
Instructions.We briefly described what is an appropriate and suitable CN, then we instructed them not to overthink during evaluation, but to give a score based on their intuition.We also provided a list of 20 HS-CN pairs exemplifying the proper evaluation.
Measurement.We opted for a scale of 0-3, rather than a CE binary response, since it allows us to study various thresholds for better data selection.
In particular, the meanings of the scores are as follows: 0 is not suitable; 1 is suitable with small modifications, such as grammar or semantic; 2 is suitable; and 3 is extremely good as a CN.We also ask to discard the pairs in which the hate speech was not well formed.For each pair we gathered two annotator scores.Filtered Data.After the non-expert evaluation, we applied two different thresholds to obtain the pairs to be presented to the expert operators: (i) at least a score of 2 by both annotators (Reviewer ≥2 ) yielding high quality data where no post editing is necessary, (ii) at least a score 1 by both annotators (Reviewer ≥1 ) providing reasonable quality with a possible need for post-editing.
The statistics reported in Table 5 show that high quality pairs (Reviewer ≥2 ) account for only a small fraction (10%) of the produced data and only one third was of reasonable quality (Reviewer ≥1 ), while the vast majority was discarded.Some randomly selected filtered pairs are provided in Appendix.

Machine Reviewer Experiment
As the machine reviewer we implemented 2 neural classifiers tasked with assessing whether the given HS-CN is a proper data pair.The two mod-els are based on BERT (Devlin et al., 2019) and ALBERT (Lan et al., 2019)

Models.
For the first model, we follow the standard sentence-pair classification finetuning schema of the original BERT study.First, the input HS-CN is represented as and fed into BERT.By using the final hidden state of the first token [CLS] as the input, originally denoted as C ∈ R H , we obtain a fixed-dimensional pooled representation of the input sequence.Then, a classification layer is added with the parameter matrix W ∈ R K?H , where K denotes the number of labels, i.e. 2 for HS-CN classification.The crossentropy loss has been used during the fine-tuning.
We have conducted a hyperparameter tuning phase with a grid-search over the batch sizes 16 and 32, the learning rates [4,3,2,1]e-5 and the number of epochs in the range of 3 to 8. We obtained the best model by fine-tuning uncased BERT-large, with a learning rate of 1e-5, batch size of 16, and after 6 epochs at the 1029 th step on a single GPU.
The second model is built by fine-tuning AL-BERT, which shows better performance than BERT on inter-sentence coherence prediction by using a sentence-order prediction loss instead of nextsentence prediction.In sentence-order prediction loss, while the positive examples are created similar to BERT by using the consecutive sentences within the same document, the negative examples are created by swapping sentences, which leads the model to capture the discourse-level coherence properties better (Lan et al., 2019).This objective is particularly suitable for HS-CN pair classification task, since HS and CN order and their coherence are crucial for our task.We fine-tuned ALBERT similarly to BERT model, by adding a classification layer after the last hidden layer.We applied the same grid-search that we used for BERT model to fine-tune ALBERT-xxlarge which contains 235M parameters.We saved a checkpoint at every 200 steps and finally, obtained the best model by using the learning rate of 1e-5, the batch size of 16, and at the 1200 th step. 4etrics.To find the best model for machine reviewer, we compared BERT and ALBERT models over the test set.Although it seems more intuitive to focus on precision since we search for an effective filtering over many possible solutions, we observed that a model with a very high precision tends to overfit on generic responses, such as "Evidence please?".Therefore, we aim to keep the balance between the precision and recall and we opted for F1 score for model selection.We report the best configurations for each model in Table 6, and the percentage of filtered pairs in

NGO Operators Experiments
To verify that the author-reviewer approach can boost HS-CN data collection, we run an experiment with 5 expert operators from an NGO.We compared the filtering strategies to reveal the best depending on several metrics constraints.Within Subject Design.We administered lists of HS-CN pairs to 5 operators from each filtering condition, and instructed them to evaluate/modify each pair in terms of 'suitableness' of the CN to the corresponding HS.Instructions.For each HS-CN pair we asked the operators: a) if the CN is a perfect answer, to validate it without any modification.b) if the CN is not perfect, but a good answer can be obtained with some editing, to modify it.c) if the CN is completely irrelevant and/or needs to be completely rewritten to fit the given HS, to discard it.
Measurement.The main goal of our effort is to reduce the time needed by experts to produce training data for automatic CN generation.Therefore the primary evaluation measure is the average time needed to obtain a proper pair.The other measurements of interest are Diversity and Novelty, to understand how the reviewing procedure can affect the variability of the obtained pairs.Procedure and material.We gave the instructions along with a list of 20 HS-CN exemplar pairs for each condition (i.e.Reviewer ≥1 , ≥2 , machine , expert ).The condition order was randomized to avoid primacy effect.In total, each NGO operator evaluated 80 pairs.Pairs were sampled from the pool of 2700 pairs described before (apart from the automatic filtering condition).To guarantee that the sample was representative of the corresponding condition, we performed a stratified sampling and avoided repeating pairs across subjects.Regarding diversity and novelty metrics, prefiltering author's output (Reviewer ≥1 , ≥2 and machine ) has a negative impact: the more stringent the filtering condition the higher the RR and the lower the novelty of the filtered CNs.We performed some manual analysis of the selected CNs and we observed that especially for the Reviewer ≥2 case (which was the most problematic in terms of RR and novelty) there was a significantly higher ratio of "generic" responses, such as "This is not true."or "How can you say this about an entire faith?" , for which reviewers agreement is easier to attain.Therefore, the higher agreement on the generic CNs reveals itself as a negative impact in the diversity and novelty metrics.Conversely, the percentage of pre-filtered pairs that are accepted by the expert increases with the filtering condition becoming more stringent, the baseline being 45% for Reviewer expert condition.
As for the amount of operators' effort, we observed a slight decrease in HTER5 with the increase of pre-filtering conditions, indicating an improvement in the quality of candidates.However, HTER scores were all between 0.1 and 0.2, much below the 0.4 acceptability threshold defined by Turchi et al. (2013), indicating that operators modified CNs only if "easily" amendable.Finally, we observe that despite reducing the ouput diversity and novelty, the reduction of expert effort by Reviewer ≥2 in terms of the percentage of the obtained pairs is not attainable by a machine yet.On the other hand, automatic filtering (Reviewer machine ) is a viable solution since (i) it helps the NGO operators save time better than human filter ≥1, (ii) it better preserve diversity and novelty as compared to Reviewer ≥2 and in line with Reviewer ≥1 .

Conclusions
To counter hatred online and avoid the undesired effects that come with content moderation, intervening in the discussion directly with textual responses is considered as a viable solution.In this scenario, automation strategies, such as natural language generation, are necessary to help NGO operators in their countering effort.However, these automation approaches are not mature yet, since they suffer from the lack of sufficient amount of quality data and tend to produce generic/repetitive responses.Considering the aforementioned limitations, we presented a study on how to reduce data collection effort, using a mix of several strategies.To effectively and efficiently obtain varied and novel data, we first propose the generation of silver counter-narratives -using large scale unsupervised language models -then a filtering stage by crowd-workers and finally an expert validation/post-editing.We also show promising results obtained by replacing crowd-filtering with an automatic classifier.As a final remark, we believe that the proposed framework can be useful for other NLG tasks such as paraphrase generation or text simplification.

Figure 1 :
Figure 1: The author-reviewer configuration.The author module produces HS-CN candidates and the reviewer(s) filter them.Finally, an NGO operator validates and eventually post-edits the filtered candidates.

Table 2 :
Some examples of the categories relevant to our analysis.Hostile from CRAWL dataset, Denouncting from CROWD, Fact (other) from NICHE.

Table 3 :
Comparison of different approaches proposed in the literature according to the main characteristics required for the dataset.

Table 4 :
Evaluation results of best author configuration with different datasets.Novelty is computed w.r.t. to the corresponding training set, RR in the produced test output.with a batch size of 1024 tokens and a learning rate of 2e-5.The training pairs are represented as [HS start token] HS [HS end token] [CN start token] CN [CN end token].While we empirically selected model checkpoint at the 3600 th step of fine-tuning with NICHE dataset,

Table 5 :
Percentage of filtered pairs according to various filtering conditions.
architectures.Training data.We created a balanced dataset with 1373 positive and 1373 negative examples for training purposes.The positive pairs come both from NICHE dataset and from the examples annotated in the human reviewer setting (Reviewer ≥2 ).The negative pairs consist of the examples annotated in the human reviewer setting, in the 'at least one 0' bin.In addition, 50 random HSs from NICHE-training are utilized with verbatim repetition as HS-HS to discourage the same text for both HS and CN in a pair, and 50 random HSs are paired with other random HSs simulating the condition of inappropriate CNs with hateful text.Test data.We collected a balanced test set, with 101 positive and 101 negative pairs.Both positive and negative examples are created replicating the non-expert reviewer annotation described in Section 7.1 for new CN generation with NICHE test set by using the author model GPT niche .

Table 6 :
F1, Precision and Recall results for the two main classifier configurations we tested.

Table 7 :
(Chung et al., 2019)ction under various configurations.RR for 'no suggestion' is computed on NICHE dataset and the time needed is the one reported in(Chung et al., 2019).Time is expressed in seconds.Pairs selec indicates the percentage of original author pairs that have been passed to the expert for reviewing, Pairs f inal indicates the percentage of selected pairs that have been accepted or modified by the expert themselves.Crowd time is computed considering that annotators gave a score every 35 seconds, and we required two judgments per pair.