At the Lower End of Language—Exploring the Vulgar and Obscene Side of German

In this paper, we describe a workflow for the data-driven acquisition and semantic scaling of a lexicon that covers lexical items from the lower end of the German language register—terms typically considered as rough, vulgar or obscene. Since the fine semantic representation of grades of obscenity can only inadequately be captured at the categorical level (e.g., obscene vs. non-obscene, or rough vs. vulgar), our main contribution lies in applying best-worst scaling, a rating methodology that has already been shown to be useful for emotional language, to capture the relative strength of obscenity of lexical items. We describe the empirical foundations for bootstrapping such a low-end lexicon for German by starting from manually supplied lexicographic categorizations of a small seed set of rough and vulgar lexical items and automatically enlarging this set by means of distributional semantics. We then determine the degrees of obscenity for the full set of all acquired lexical items by letting crowdworkers comparatively assess their pejorative grade using best-worst scaling. This semi-automatically enriched lexicon already comprises 3,300 lexical items and incorporates 33,000 vulgarity ratings. Using it as a seed lexicon for fully automatic lexical acquisition, we were able to raise its coverage up to slightly more than 11,000 entries.


Introduction
With the rapid diffusion of social media in our daily lives, we currently experience (and many of us foster) a fundamental change of social communication habits. A main feature of this new era is an unprecedented degree of public exposure and visibility of individuals via (very) large and intentionally open networks of "friends" or "followers." Blogs, chat rooms and online fora constitute even looser connected social networks with lots of personally weakly acquainted or even unknown interlocutors engaged in digital discourses. Unfortunately, the chance for malicious interactions is promoted by the sheer mass of players in these networks and easy ways of hiding real individual identities via nick names or technically slightly more advanced means of camouflage, such as fake Web identities, including non-benevolent software agents and chatbots (McIntire et al., 2010).
Yet, how can we distinguish sloppy colloquial language we all use here and there from explicitly abusive and inacceptable wording, the topic we focus on in this paper, i.e., the kind of linguistic behavior typically socially banned from civilized discourse?
The standard way to deal with this challenge is to define category systems (binary ones, such as obscene vs. non-obscene, or staged ones, as illustrated by pejorative vs. rough vs. vulgar) and letting people decide on the assignment of lexical items to these discrete categories. Once such categorical features are available, these lexical resources can be exploited for analytic purposes. Traditionally, these decisions were made by few lexicographers but this approach suffers from subjectivity and lack of flexibility, since this lexicon of improper words is rapidly growing due to the productiveness of language and thus changing almost every day.
Alternatively, a larger number of crowdworkers can be hired to provide such category assignments which increases the level of objectivity (on the basis of inter-worker consensus) and currency (campaigns can be run without delay, on demand, with low budgets). Yet, crowdsourced assessments, as with lexicographers' judgments, inherently suffer from the problems of permeable and soft category boundaries-what is rough for one person may be vulgar for another and vice versa.
We challenge the established view that the representation of obscenity of language is a discrete categorical classification problem-no matter which category system is chosen-but rather assume that it is a matter of differential degree. Accordingly, we describe the empirical foundations for bootstrapping and scaling such a lexicon from the low end of stylistic conventions on degrees of obscenity. We start from expert-level lexicographic categorizations of a small set of pejorative/rough/vulgar lexical items, enlarge this set by distributional semantics methods and, then, determine the degree of obscenity of the items assembled this way by letting crowdworkers make individual assessments relative to the semantic poles "neutral" and "vulgar" using a best-worst scaling approach Mohammad, 2016, 2017).
The resulting lexicon targeting that lower end of German language comprises already 3,300 lexical items, incorporates 33,000 human ratings, and serves as a seed lexicon for fully automatically acquiring and scoring new lexical items from the same register. After several iterations, we finally come up with VULGER, a lexicon of VULgar GERman, totalling slightly more than 11,000 entries.

Related Work
Lexicons covering offensive language are almost only available for the English language. Perhaps the earliest collection of such lexical items (including phrases and multi-word expressions) is due to Razavi et al. (2010) who manually assembled approximately 2,700 dictionary entries. More recent work on an alternative verb-centered lexicon (size is not specified) with a focus on hate speech is reported by Gitari et al. (2015). The currently largest and most up to date English lexicon of abusive words is provided by Wiegand et al. (2018a) who manually and automatically collected around 8,500 lexical items. 1 Languages other than English are incorporated in HURTLEX 2 (Bassignana et al., 2018) which forms a multilingual lexical resource of words that hurt for 53 languages, among them Italian, Spanish, English and German. This lexicon grew out of a manual selection of roughly 1,000 Italian hate words originally organized around 17 categories, with particular focus on derogatory words. It was further semi-automatically extended with complementary borrowings from the Italian MULTI-WORDNET 3 and BABELNET. 4 HURTLEX also excels with additional linguistic information (parts of speech, lexicographic definitions) for its lemmas. The lexicon integration step yields roughly 1,160 multilingual lexical items (with the help of the BABELNET API).
Manual curation (for the Italian portion) included a categorization step for each lemma sense into one of three categories: 'Not Offensive'-'Neutral'-'Offensive'. In a subsequent step, the 'Neutral' category was split into 'Not Literally Pejorative' (insult by means of a semantic shift, e.g. metaphorically) and 'Negative Connotation' (not necessarily a direct derogatory use but used in a derogatory way). 2-expert agreements plunged from 87.6% for the 3-category decisions to 61% for the extended 5-category decisions. Clearly, an indicator that such categorical decisions are hard to make even for competent native speakers.
As far as canonical German lexical resources are concerned, their coverage at the low end of lan-guage is, not surprisingly, more than incomplete. In effect, GERMANET V13.0, 5 for instance, covers only 1,774 lexical items from our seed lexicon (3,300 lexical items, in total). Yet, this ratio is even higher than for other lexical resources such as HATEBASE, 6 a repository which covers 95 languages (with 2,691 hate terms), yet only enumerates 95 manually provided German hate speech entries at all.
In conclusion, the compilation of lexicons for offensive, abusive or hate language typically consists of two steps. First, already available lexical resources covering such pejorative lexical items are identified and bundled in a seed lexicon. Next, this seed is incrementally enlargedusing additional lexical resources (such as WORD-NETs, WIKTIONARY, or BABELNET), or employing some sort of machine learning process (Wiegand et al., 2018a). Yet, the semantic core of such lexicons are (manual or automatic) categorical assignments of either bi-polar (e.g., 'Offensive' vs. 'Non-Offensive') or multi-polar categories (e.g., 'Colloquial' vs. 'Rough' vs. 'Obscene').
As an alternative to this scheme, our work focuses on substituting discrete categorical decisions by continuous grading of the above distinctions based on Best-Worst Scaling (Louviere et al., 2015). We thus target a research desideratum already described by Schmidt and Wiegand (2017, p.3-4) in the following way: "Despite their general effectiveness, relatively little is known about the creation process and the theoretical concepts that underlie the lexical resources that have been specially compiled for hate speech detection."

(Tentatively) Characterizing Vulgar Language
In our study, we not only consider hate speech and abusive terms, but take a much broader perspective on the topic of offensive language and its lexicalizations. Still, this goal is very hard to characterize by distinctive criteria since many lexical-semantic dimensions seem to be involved and strongly interact.
Vulgar language, as we conceive it, is predominantly signalled by an overly lowered language register, the taboo layer, with disgusting and obscene lexicalizations generally banned from any type of civilized discourse. Primarily (yet not only), it addresses the lexical fields of sexuality (sexual organs and activities, in particular), as well as body orifices or other specific body parts (e.g., "Fresse" ("puss") as a negative denotation for "Gesicht/Mund" ("face/mouth")) and scatologic expressions. One often also observes meaning transfers from animals with culture-dependent negative connotations to humans (e.g., "Ratte" ("rat")). Pejorative words with marked negative connotation also play a significant role here (e.g., "abkratzen" ("croak")). Especially religious, ethnic and political orientations, the primary targets of hate speech, gain a strong vulgar status when they are combined with (animal-related) swearwords such as "Schwein" ("pig").
We are aware of the preliminary status of this characterization of vulgar language, but consider our work as a starting point for clarifying its nature and systematicity in more depth.

Lexicon Acquisition Method
Since a broad-coverage lexicon of obscene German (ranging on an interval from neutral to vulgar) is missing, we decided on a weakly supervised approach to lexicon acquisition based on bootstrapping. It consists of the following steps (the over-all workflow is fundamentally inspired by the work of Wiegand et al. (2018a), yet complements it by a hitherto unexplored methodology to scale the degree of obscenity of lexical items based on bestworst scaling): 1. Language Resources: Select a seed lexicon (possibly combining numerous relevant resources) which contains a collection of lexical items already tagged as rough and vulgar. Typically, this step reuses manually precategorized lexical items (work typically due to experienced lexicographers). Further, this lexical collection can be enhanced by exploiting large-scale corpora-these can either be already annotated for (some degree of) vulgarity or lack any annotation of this kind at all-or representational derivatives therefrom, such as (word) embeddings.
2. Human Assessment: Refine the seed lexicon by complementary human assessments of obscenity/vulgarity on the basis of crowdsourcing using differential best-worst scaling. Figure 1: Generic language-independent workflow for lexicon acquisition (in blue) and its instantiation for German (in green); solid blue arrows indicate control flow, data flow is represented by dashed blue arrows, green arrows and '+' stand for lexical data harvesting (with RegEx-style expressions for matching search terms), thin blue lines link particular choices to realizations (implementations) of the blue main components of VULGER's acquisition system 3. Machine Learning: Use the resulting lexicon scored on a continuous neutralityvulgarity scale as training data for automatically identifying and scoring new, thematically relevant lexical items, ideally from corpora containing a high amount of words regarding the property of interest (rough and vulgar wording).
The first step of this workflow (illustrated in Fig. 1), consisting of the assembly of relevant lexical material from scratch, will be described in Section 5. The second one, adding human assessments for that lexical material, is dealt with in Section 6, while the third step, automatic lexicon enhancement, is described in Section 7.

Building the Seed Lexicon
From the German slice of WIKTIONARY, 7 we extracted all words marked as vulgar, rough and pejorative. 8 Additionally, we gathered entries tagged with corresponding categories 9 from the German OPENTHESAURUS. 10 As the focus of our corpus lies on single words 11 from a vulgar vocabulary, we excluded phrases from this processing step.
The list resulting from this first step contained some entries used as affixes in morphologically productive word formation, such as "*geil", "scheiß*", "Scheiß*" and "Drecks*", the latter ones denoting variants of "shit" and "dirt". We cancelled these rudimentary entries from the list because there is no way to get meaningful judgments for them in isolation due to the many possible combinations yielding highly diverse degrees of vulgarity.
In order to account for these terms in a reasonable way we extended our list by harvesting rough and vulgar word forms concatenated with the affixes mentioned above from the CODE ALLTAG S+d email corpus 12 (Krieg-Holz et al., 2016), the DORTMUNDER CHAT KOR-PUS (Beißwenger, 2013) 12 and from entries in FASTTEXT (Grave et al., 2018) word embeddings, the latter being based on COMMON CRAWL and WIKIPEDIA.
Yet, we not only incorporated plain text corpora or computationally derived lexical items (exploiting the FASTTEXT embeddings) into our study, but also included word embeddings as a representation format based on the distributional semantics hypothesis and computationally derived from corpora (see also Tulkens et al. (2016); Wiegand et al. (2018a). Utilizing the word embeddings from the corpora mentioned above and the GENSIM module (Řehůřek and Sojka, 2010) we further generated, using the lexical seeds from the previous round, similar words, i.e., close semantic neighbors of these seed words by iteratively minimizing the threshold for similarity, until we found too much noise was returned (a common procedure, cf. also Tulkens et al. (2016)). We manually edited the resulting list in regard to inflected forms, misspellings and case sensitivity, but we intentionally kept the 'lexical noise', i.e., presumably neutral words. Since we planned to annotate the lexical items identified this way by crowdsourcing in a later phase, these neutral words also help counterbalance the impact of rough and vulgar expressions during assessments. In total, based on this procedure we gathered a seed lexicon with 3,300 entries.

Enriching the Seed Lexicon: Scaling Degrees of Vulgarity
We chose to annotate our seed words with Best-Worst-Scaling (BWS), because it delivers highquality annotations with only a relatively small number of annotation steps. BWS is an extension of the method of paired comparison to multiple choices, originally developed by Louviere et al. (2015) and introduced into NLP for emotion scaling by Mohammad (2016, 2017). For BWS, annotators are presented with n items at a time (an n-tuple, where n ¿ 1, and typically n = 4). They then have to decide which item from the n-tuple under scrutiny is the best (highest in terms of the property of interest) and which is the worst (lowest in terms of the property of interest). In our case, judges had to select the most neutral and the most vulgar terms per given n-tuple. We used the BWS tool 13 from Mohammad (2016, 2017) to generate 2N decision alternatives (N denotes the size of our seed lexicon) and thus came up with 6,600 4-tuples to be assessed. Tuples were produced randomly under the premise that each term has to occur only once in eight different tuples and each tuple is unique.
For the annotation process proper, we used the crowdsourcing platforms FIGURE EIGHT 14 and CLICKWORKER, 15 where each n-tuple was assessed by five annotators. In order to get realvalued scores from the BWS annotations we applied COUNTS ANALYSIS (Orme, 2009) 16 and thus got scores between +1 (most neutral) and −1 (most vulgar). Scores were calculated by subtracting the percentage of times the term was chosen as worst from the percentage of times the term was chosen as best. We computed the split-half reliability 16 like Kiritchenko and Mohammad (2017) by randomly splitting the annotations of a tuple into two halves, calculating scores independently for these halves and measuring the correlation between the resulting two sets of scores. We got an average Pearson correlation of 0.9102 (+/ − 0.0022) over 100 trials.

Regression Models
In order to further extend the lexicon in a purely automatic way and also inspired by studies on automatic word emotion induction (especially by Li et al. (2017a) and Buechel and Hahn (2018)) we employed regression models to predict scores for input words. The seed words served as training and testing data for a linear regression and a ridge regression model (linear regression with L 2 regularization during training). 17 As features for the words we used their respective word embeddings (this, obviously, excludes lexical items from further consideration for which no embeddings exist).
We experimented with different word embeddings.
We built 100-dimensional word embeddings from CODE ALLTAG XL (Krieg-Holz et al., 2016) using WORD2VEC (Mikolov et al., 2013) for all words occurring at least 3 times in CODE ALLTAG XL . Furthermore, we employed WORD2VEC word embeddings from Reimers et al. (2014) with a minimum word frequency of 5 and 100 dimensions (UKP), 300-dimensional FASTTEXT word embeddings from SPINNING-BYTES (Cieliebak et al., 2017) trained on German tweets (TWITTER) and, finally, FASTTEXT word embeddings (Grave et al., 2018) based on COM-MON CRAWL and WIKIPEDIA (FASTTEXT). We also tried to utilize embeddings generated from the German TWITTER HATESPEECH corpora from Ross et al. (2016) and Wiegand et al. (2018b) under the assumption that they might contain a large number of rough and vulgar words. But due to their small size and their nevertheless high proportion of out-of-vocabulary words we had to exclude both of these resources from further consideration. Table 1 shows that the ridge regression model performs equally or slightly better compared to the linear regression model. Regarding the input features the FASTTEXT token embeddings performed best (see Table 2).

Applying Regression Models to Enhance the Lexicon
We used the best method (ridge regression and FASTTEXT embeddings) to extend our lexicon with three German swearword lists. 18 There is an overlap between swearwords and vulgar lexicalizations, but not every swearword has strong vulgar status, 19 e.g., "Schwein" ("pig"), a subtle distinction which our scaling approach accounts for (cf. also the remarks made in Section 3). We trained a ridge regression model on the seed words (cf. Sections 5 and 6), i.e., the respective word embeddings and the scores. This model was then applied to the input swearwords (from the three sources mentioned above), which do not occur in the seed lexicon already, and predicted the neutrality/vulgarity scores of the remaining entries on the basis of their word embeddings provided that an embedding for the respective word was found in the FASTTEXT embeddings. 20 We excluded out-of-vocabulary words in order to avoid getting too much noise in terms of wrongly scored lexical items in our lexicon. Further we thus dropped really rare words. With the words already contained in our seed lexicon and words not present as embeddings removed, we assembled 2,046 additional entries following this approach.
Assuming that corpora for hate speech detection include a higher amount of vulgar and rough words, we also made use of such datasets. There exist two publicly available German-language text corpora annotated for hate speech from which we extracted lexical material. The first of them, IWG HATESPEECH, originating from Ross et al. (2016), contains about 500 tweets which were annotated by two judges using a binary categorization scheme ("hate speech": Yes or No) and a 6point Likert scale ranging from "not offensive" to "very offensive". 21 The second corpus collected by Wiegand et al. (2018b) contains more than 8,500 tweets and was compiled for GERMEVAL 2018, a challenge task addressing the recognition and classification of offensive German language. 22 The latter corpus was coarsely annotated with binary 'Offense' and 'Other' categories, but it also comes with a 4-way classification schema where besides the non-offensive 'Other' class 'Offense' was subdivided in three ways: 'Profanity' (no intent to insult someone, yet the lexical choice is negatively marked, with swearwords such as sca- 19 Also not every vulgar word is a swearword. 20 We also checked for different spellings regarding case sensitivity. 21 The corpus is available at https://github.com/ UCSM-DUE/IWG_hatespeech_public 22 The corpus is available at https://projects. cai.fbi.h-da.de/iggsa/ tologic "Scheiße" (shit)), 'Insult' (clear intent to offend someone) and 'Abuse' (an even stronger form of 'Insult', i.e., an abusive utterance that degrades a target person/group by ascribing a social identity to a person/group that is judged negatively by a (perceived) majority of society).
From these two corpora we extracted words from all tweets marked as 'Offense' = 'YES' by one of the annotators and further removed stop words, hashtags and words with non-alphabetic characters excluding hyphens or a word length smaller than 4. We also tried to lemmatize the words 23 and normalize spellings in regard to case sensitivity, but admittedly inserted some noise into our input words, i.e., some inflected forms and other forms of semantic duplication could not be normalized. After excluding words already present in the seed lexicon or in the German swearwords lists we applied the same procedure as used for the swearwords and obtained another 5,700 new scored lexical entries.
Due to the lack of better resources we tried to measure the reliability of the resulting scores in a preliminary way by calculating the correlation between the probability of a word being in an offensive post and its score. We got a Pearson correlation coefficient of only −0.35, probably also caused by many words occurring just once, but the correlation may also be inherently weak. In future work, we plan to evaluate the automatically determined extension of our seed word lexicon by feeding the lexical items back into another crowdsourcing round and determining the correlation between the human assessment and the automatically derived scoring values.
The final version of VULGER, a lexicon with VULgarity ratings of GERman words, enhanced with swearwords and words from the two hate speech corpora in the end comprises 11,046 entries (see Table 3).

Conclusion
In this paper, we are concerned with the lexical segment at the lower stylistic end of each natural language often referred to as rough, vulgar and obscene. This register typically covers very explicit and rude linguistic expressions (taboo words). Standard lexical repositories have mostly neglected lots of these expressions on purpose although a pressing need can now be derived for such an extension, e.g., for the purpose of identifying and neutralizing or blocking offensive and humiliating utterances in social media. Our workflow for building such a lower-end lexicon is based on three steps: assembling already existing lexicons (or fragments therefrom) for this stylistic subvariety of language, assigning degrees of vulgarity for each lexical item included, and using this seed for continuous automatic enhancement by weakly supervised machine learning procedures.
As far as the representation of the semantics of these lexical items are concerned, we propose a continuous grading system to substitute overly simplistic discrete categorical schemata which have been prevailing so far. Still, the claim that such a fine-grained representation is helpful at all must also be demonstrated by experiments in the future. In any case, we plan to use and iteratively extend our newly developed lexicon on text corpora with similar biases into pejorative languages (including scores for obscenity). However, merely (automatically) extending a specialized lexicon might not necessarily prove beneficial as evidenced by the results of Tulkens et al. (2016) that showed no performance boost for a system using such an extended dictionary, at least for detecting Dutch racist language.
In order to by-pass the sparse data problem, methods like transfer learning might also be appropriate here (Sahlgren et al., 2018). Still, the validity of these new items and their scores have to be experimentally validated, e.g., by feeding newly found lexical material back to annotators and compare their judgments with automatically predicted ones.
We are also aware of the fact that purely lexically driven approaches to account for obscene, offensive or vulgar language may not be sufficient to solve the recognition problem completely and that a broader discourse context has to be taken into account, as well as the linguistic conventions in different communities (Owsley Sood et al., 2012). Still, a lexicon of significant size and quality might form the backbone for machines sensitive to rude and vulgar language.