Oxymorons: a preliminary corpus investigation

This paper contains a preliminary corpus study of oxymorons, a figure of speech so far under-investigated in NLP-oriented research. The study resulted in a list of 376 oxymorons, identified by extracting a set of antonymous pairs (under various configurations) from corpora of written Italian and by manually checking the results. A complementary method is also envisaged for discovering contextual oxymorons, which are highly relevant for the detection of humor, irony and sarcasm.


Why oxymorons?
The growing body of research on figurative language in NLP has recently witnessed an expansion from more traditional domains (like idioms, metaphors, metonymy) to other ubiquitous phenomena such as irony, puns and sarcasm. However, other -supposedly more marginal -figures of speech have received less attention so far. The oxymoron is a case in point.
The oxymoron, which has been studied mainly in rhetoric and literature, is "[a] figure of speech in which a pair of opposed or markedly contradictory terms are placed in conjunction for emphasis", although its meaning has expanded to comprise more generally "a contradiction in terms" (definitions from OED: www.oed.com). Typical examples of oxymorons would be deafening silence, sweet sorrow or awfully good.
The oxymoron is obviously closely intertwined with the semantic relation of antonymy: it is often the union of antonymous items that creates the oxymoron's paradoxical effect, thus generating a new meaning, which often heavily depends on context. As Gibbs & Kearney (1994: 86) observe, " [u]nderstanding oxymora requires that people access relevant world knowledge to constrain their creative interpretations of seemingly contradictory concepts".
In this paper, we set out a preliminary investigation of oxymorons based on naturally occurring data from Italian, with a view to contributing to the NLP-oriented research on figurative language by supplying an initial list of oxymorons and oxymoronic structures that can be used for further analyses and for evaluation tasks. The questions that drove our investigation are: What kind of oxymorons do we find in common language? What syntactic constructions are involved in their creation? And above all, how can we detect them in corpora?
In Section 2 we describe the methodology we used to detect oxymorons in corpora of contemporary written Italian. Although our study is based on Italian, the procedure we followed can easily be extended to other languages. In Section 3 we illustrate the main results of our analysis. The full list of oxymorons detected is available in the Appendix. Section 4 outlines another promising complementary methodology to be pursued in future research. Finally, Section 5 discusses possible further developments, with special attention to the challenges oxymorons pose for automatic identification and extraction.

Methodology
The procedure we devised to track down oxymorons in corpora of written Italian stems from the observation that these constructions are closely connected with antonymous pairs, which have been the subject of several studies based on their co-occurrence in texts (cf. e.g. Charles & Miller 1989;Justeson & Katz 1991;Lobanova 2012;Kostić 2017).
Our starting point was Jones ' (2002) analysis of English antonyms, which makes use of a list of canonical antonymous pairs to be searched in a corpus of texts from The Independent (approx. 280M words). We therefore translated Jones' antonymous pairs into Italian and made a selection out of this set, driven mainly by the exclusion of predicative (e.g. confirm ~ deny) and adverbial (e.g. badly ~ well) couples.
This resulted in a list of 17 noun ~ noun antonymous pairs, displayed in Table 1.  Then we designed an inventory of potential oxymorons. All the constructed couples were searched, as lemmas, in two large corpora of contemporary written Italian: Italian Web 2016 (itTenTen16, through the SketchEngine platform:
The above-mentioned inventory of potential oxymorons was built in the following way.
First, we matched each noun of each pair (e.g. odio 'hate') with its antonym (amore 'love') in either adjectival (amoroso / amorevole 'loving') or verbal (amare 'to love') form. With this first round of extractions, we obtained combinations such as odio amoroso (lit. hate lovely) and amorevole odio (lit. lovely hate) 'loving hate', as well as l'amore odia 'love hates', although sequences containing verbs were quite uncommon. However, the search for lemma verbs retrieved also a number of participial forms used as adjectives (as in amore odiato 'hated love', where odiato is the past participle of odiare 'to hate').
In addition, to enrich data retrieval, we selected lexemes semantically related to the members of the antonymous pairs in Table 1 (synonyms, hyponyms, etc.) from the Grande Dizionario Analogico della Lingua Italiana (Simone 2010). This step was inspired by Shen's (1987: 109) definition of indirect oxymoron, i.e. an oxymoron where "one of [the] two terms is not the direct antonym of the other, but rather the hyponym of its antonym" (like whistling silence, where whistling is a type of noise). Related lexemes were also searched in the two above-mentioned corpora. For the sake of exemplification, we illustrate some of these paradigmatic expansions for the following three antonymous pairs, for which we retrieved a considerable amount of data: The final phase of the analysis implied the interrogation of the Sketch Engine Word Sketch tool, which describes the collocational behavior of words by showing the lexemes that most typically co-occur with them, within specific syntagmatic contexts, by using statistical association measures. In this case, we searched for all nouns participating in the antonymous pairs in Table 1 and we manually revised all top results provided the Word Sketch function (thus focusing on the most statistically significant combinations). Beside oxymorons that we had already retrieved with the previous procedure (e.g. the very frequent silenzio assordante 'deafening silence'), this method allowed us to identify new configurations, for instance sentential patterns where the two opposite nouns are linked by the copula è 'is' (e.g. la luce è tenebra 'light is darkness') or prepositional phrases where the two opposite nouns are linked by a preposition (e.g. il fragore del silenzio 'the racket of silence').

Results
The multiple-step procedure described in Section 2 resulted in a final list of 376 oxymorons, the first of its kind in Italian, to the best of our knowledge. The full dataset (Italian oxymorons 1.0) is provided in the Appendix and released as an Excel file through the University of Bologna Institutional Research Repository (AMSActa): http://amsacta.unibo.it/id/eprint/6388 Around 20% of the oxymorons were found in both corpora, whereas the vast majority (almost 80%) was retrieved in itTenTen16, which is much larger than CORIS (4.9 billion vs. 150 million words).

Syntactic structure
We classified the 376 oxymorons according to their syntactic structure. A quantitative summary is given in Table 2, which reports, for each of the 9 structures we could identify, the number of oxymorons with that structure and the number of antonymous pairs that generate oxymorons with that structure.  . We found mostly past participles (see the aforementioned examples), but also a few present participles (e.g. prossimità distanziante 'distancing proximity').

Syntactic
The third most frequent structure in terms of number of oxymorons is Sentence (S). The examples belonging to this class emerged especially (though not entirely) through the exploration of the Word Sketch function (cf. Section 2). The antonymous nouns in our pairs were found both in copular sentences (e.g. l'amore è odio 'love is hate', il silenzio è rumore 'the silence is noise') and in subject-verb sentences (e.g. il silenzio grida 'the silence screams', il buio illumina 'the dark illuminates (something)'). Some sentence-level oxymorons are borderline cases, since they could be argued to qualify as paradoxes rather than oxymorons (e.g. il buio è luce 'the dark is the light'). We decided to keep them, since the divide between oxymorons and paradoxes is not so clear(-cut): as Flayih (2009) claims, the "oxymoron is sometimes taken as 'condensed paradox' and paradox as 'expanded oxymoron'".
We also retrieved a considerable number of adverbial oxymorons of the [Adv A] type (e.g. allegramente depresso 'cheerfully depressed', luminosamente oscuro 'brightly dark') -which is relevant also in terms of the number of antonymous pairs it represents (16 out of 17, cf. Table 2) -and some [N Prep N] oxymoronic expressions, such as la tenebra della luce 'darkness of the light'. As for the latter category, all examples contain the preposition di 'of', except for il silenzio nel rumore 'the silence into the noise', which contains in 'in'.

Antonymous pairs
As for the antonymous pairs taken into consideration (Table 1), we observe a rather unequal distribution in our data in terms of their ability to create oxymorons. Some pairs -such as caldo ~ freddo (hot ~ cold), silenzio ~ rumore (silence ~ noise) or felicità ~ infelicità (happiness ~ unhappiness) -generate a high number of oxymoronic constructions, whereas others are definitely less exploited, like coraggio ~ paura (bravery ~ fear), guerra ~ pace (war ~ peace) or leggerezza ~ pesantezza (lightness ~ heaviness). Overall, the pairs with the higher number of oxymorons are also those displaying a wider array of syntactic structures (Table 2), but the differences are not so great. All the pairs are represented by at least 3 structures (out of 9 possibilities).
The complete quantitative picture is given in Table 3, where antonymous pairs are reported in English translation (like in the Appendix) for convenience.  At first glance there does not seem to be a strong and clear driving principle behind the unequal distribution of the pairs in terms of semantics. For instance, we find abstract concepts (e.g. happiness ~ unhappiness, justice ~ injustice) or sensorial concepts (e.g. hot ~ cold, lightness ~ heaviness) at various points of the list in Table 3. However, a more fine-grained semantic analysis and a larger dataset would be necessary to draw more solid conclusions.
Of course, the higher number of semantically related lexemes investigated for some pairs plays a clear role (cf. Section 2). Also the entrenchment of some notorious cases might be relevant. Take for instance silenzio assordante 'deafening silence', which occurs 2564 times in the itTenTen16 corpus (plus 1517 times in the reverse adjective-noun order: assordante silenzio). The high token frequency of this specific oxymoron may favor the creation of new oxymorons in the very same conceptual domain (silence ~ noise).

Morphosyntactic variability
Although the morphosyntactic variation of oxymorons is not the focus of the present study, we can preliminary observe that, according to the data we collected so far, oxymorons are rather flexible structures. In other words, contrary to many multiword expressions (cf., among many others, Sag et al. 2002), oxymorons seem to show a low degree of fixedness.
Many combinations of a noun and an adjective are attested in both orders, although with different frequency (remember that noun-adjective is more neutral than adjective-noun): see the couple silenzio assordante vs. assordante silenzio mentioned at the end of Section 3.2, or tenebra luminosa (9 tokens in itTenTen16) vs. luminosa tenebra (4 tokens in itTenTen16) 'bright shadow '. In [N Prep N] oxymorons, the second (nonhead) noun may occur in both singular and plural (contrary to Italian multiword expressions belonging to the same pattern, where the non-head noun is morphologically fixed, cf. Masini 2009), although the singular form is generally preferred, e.g.: suono del silenzio 'sound of the silence' (380 tokens in itTenTen16) vs. suono dei silenzi 'sound of the silences' (7 tokens in itTenTen16).
As for sentential oxymorons, we often found different configurations for the same pair of items; for instance, silenzio 'silence' and rumore 'noise' are found as: il silenzio è un rumore (che…) 'silence is a noise (that…)', il silenzio è rumore 'silence is noise', il silenzio è il rumore (di…) 'silence is the noise (of…)' (all these variants are reported as a single entry -il silenzio è rumore 'silence is noise' -in the Appendix).

A complementary method
Another complementary technique to harvest oxymorons from corpora is, quite trivially, to search the word ossimoro 'oxymoron'. We noticed that, when they come across or use (what they believe to be) an oxymoron, speakers tend to comment on it metalinguistically, as in: una guerra santa (grandissimo ossimoro) 'a holy war (huge oxymoron)' (from CORIS). This behavior allows to detect oxymorons which would probably be missed otherwise.
To test this method we searched the string [.*ossimor.*] in CORIS (retrieving 223 hits). Upon preliminary manual checking, we found a number of valid oxymorons. A few were already included in our list, like tenebra luminosissima 'very bright shadow' (cf. tenebra luminosa 'bright shadow' in the Appendix). Another few were examples, related to one of the antonymous pairs we considered (cf. Table 1), which were not retrieved by our method, like boato di silenzio 'roar of silence' (belonging to the silenzio ~ rumore 'silence ~ noise' pair).
Most cases, however, were completely new oxymorons. Some could have been ideally identified with our method based on antonymous pairs (e.g. normalmente eccezionali 'normally exceptional', piangendo rido 'I laugh crying', caoticamente ordinate 'chaotically tidy'), but many others turned out to be hardly foreseeable and heavily dependent on context and on the speaker's beliefs and communicative intentions. Therefore, this method might be especially promising for identifying creative, highly contextual oxymorons. In this respect, social media platforms are definitely one of the sources to be used for this kind of research, which we leave for future studies.

Future challenges
This short paper illustrates the results of an investigation aimed at tracking down oxymoronic constructions in corpora of written Italian. Far from being exhaustive, this is just a preliminary attempt, resulting in an initial list of 376 oxymorons that needs to be enriched.
On the one hand, this enrichment may be pursued by adding more antonymous pairs under the current approach, which however heavily relies on manual work at various stages and would benefit from a higher degree of automation.
On the other hand, different methods should be devised and tested for the (we suspect) wide set of oxymorons that cannot be traced back to antonymous pairs. For instance, some oxymorons actually contain synonymous or even identical elements, within specific structures, such as suoni senza suono 'sounds without sound' (retrieved from itTenTen16). More interestingly, as Gibbs (1993: 269) observes, "oxymora are frequently found in everyday speech, many of them barely noticed as such, for example, 'intense apathy,' 'internal exile,' 'man child,' 'loyal opposition,' 'plastic glasses,' 'guest host,' and so on. The ubiquity of these figures suggests some underlying ability to conceive of ideas, objects, and events in oxymoronic terms." But -we add -it also suggests that the detection of these tropes is yet another challenge for automatic extraction, especially considering that some are heavily contextual.