Exploring Lexical-Semantic Knowledge in the Generation of Novel Riddles in Portuguese

We describe an effort towards the automatic generation of novel riddles in Portuguese, ultimately with humour value. Riddle generation ﬁts in the common architecture of a NLG system and may follow different models, described here, all based on features of a concept, acquired from a lexical-semantic knowledge base. Generated riddles were manually assessed by humans, who rated them as fairly interpretable, surprising, and novel, even if with low humour potential.


Introduction
To act naturally, computers should be able to entertain, e.g., with the creation of wordplay or humour. This paper is about the automatic generation of novel riddles, which would ideally be apt for humorous contexts. Riddles are a kind of linguistic puzzle, present in most cultures and languages. They can be posed as a question, followed by a short hiatus -allowing for the audience to think -, and finally an answer that works as a punchline. Their generation is closely related to the topic of Computational Humour (Ritchie et al., 2006), which studies the utilisation of humour by computers, with value for Natural Language Understanding -specifically, developing models for automatically recognising and understanding humour (Mihalcea and Strapparava, 2006;Yang et al., 2015) -, and for Natural Language Generation (NLG) -developing systems that produce verbal humour (see section 2), which is our case.
Both riddles and humour have been generated by others, following different approaches, but often in English. Our work is inspired by the previous and represents an effort towards the development of a riddle generator in Portuguese. We exploit available lexical resources for this language, mainly a semantic knowledge base, and implement six models for generating novel riddles, hopefully funny, due to the introduction of potentially incongruent word-play, forcing the reinterpretation of known concepts or the creation of new ones. This is a follow-up, as we have previously proposed the six generation models in a single-page paper (Gonçalo Oliveira and Rodrigues, 2018) and analysed the results of different word sources, relations, presentation modes, and automatic scoring, for two of those models (Gonçalo Oliveira and Rodrigues, 2018b).
We frame our current results as piadas secas (roughly, dry jokes) which, in Portugal, are a kind of joke that may be presented as a riddle, with a question and a short not-so-funny, obvious or nonsensical answer, taking advantage of the anti-climax to make people laugh. In 2017, this kind of jokes seemed to have had a comeback, as they were used in several television shows and YouTube videos, and re-compiled in websites or edited books (Pinto et al., 2017). This is why the system is baptised as SECO (dry).
The remainder of the paper starts with a brief overview on related work, covering the generation of riddles and verbal humour. The steps for the generation of riddles with the help of a knowledge base are then described, with some examples. Before concluding, we present the results of human validation of generated riddles, illustrated with additional examples. According to human subjects, riddles are accessible, surprising and novel, but, on average, not so funny.

Related Work
Riddles have long been a research topic. Georges and Dundes (1963) define them as "a traditional verbal expression which contains one or more descriptive elements, a pair of which may be in opposition; the referent of the elements is to be guessed." Palma and Weiner (1992) confirm that lexical ambiguity, arising, for instance, from pol-ysemy or homophony, is paramount to their creation, alongside the association of words to ad-hoc categories, according to their multiple meanings.
Riddle generation by computer programs has been addressed in the 1990s, with the seminal work of Binsted and Ritchie (1994), who developed JAPE, a system that generates punning riddles based on a syntactic and semantic lexicon; a set of schemata for combining two words based on their lexical or phonetic relationship; and a set of templates that render riddles as text. An example of a riddle would be "What do you call a murderer that has fibre? A cereal killer." STANDUP (Manurung et al., 2008) adopts a similar approach, but is more cautious on the words used, restricted at various levels for a better output and suitability for the intended audiences.
TheRiddlerBot (Guerrero et al., 2015) generates riddles about famous characters. After selecting a well-known name from a knowledge base: associated features are retrieved; analogous characters, with common features, are identified; a textual template is selected for rendering the riddle, based on some of the features or the analogy; the riddle is posted in Twitter; and aliases are retrieved from Wikipedia, so that users may answer with the name of the selected character or one of its aliases. An example of a riddle for the Joker would be: "Tell me the name of a person that is the Morpheus of The Dark Knight Rises, is criminal, playful yet cruel, has been seen wearing a purple topcoat." Riddles have also been generated from word associations (Galvan et al., 2016). Given a concept: its possible categories are obtained from a creative thesaurus; associated modifiers are also retrieved and one is randomly selected; new categories, to which the selected modifier is associated to, are retrieved; a final category is composed by combining the selected modifier with one of the new categories; a concept of the final category is used to fill a text template. An example of a riddle for the sun would be "What is as hot as soup?" Unlike JAPE and STANDUP, this system and TheRiddlerBot do not tackle the humour aspect specifically.
The ConceptNet semantic network was exploited for the generation of verbal humour, including riddles (Labutov and Lipson, 2012). For this purpose, different, but overlapping, paths between the same two concepts are aligned with a surface template that maximises inter-path incongruity. For question-answer riddles, the question mentions two concepts from different paths, but same domain, while the concept in the answer is in one of the paths, but from a different domain. An example of a riddle is "Why is the computer in hospital? Because the computer has virus." Besides riddles, there is work on the generation of other kinds of verbal humour, including funny acronyms , or short messages (Valitutti et al., 2016). Both explore lexical replacement for potentiating humour. Replacement words are constrained by the original form (same initial letter or similar sound) and possibly other humour-specific features (e.g., taboo).
All of the previous works generate riddles or humour based on some theory and knowledge resources, essential for their goal. They share a number of similarities with ours, but they all produce text in English. A few exceptions generate humour in other languages (e.g., Sjöbergh and Araki (2007)), but none generates riddles in Portuguese. In this language, however, there is work on the automatic generation of Internet humour (Gonçalo Oliveira et al., 2016), based on an image macro and a line of text.
Humour occurs when there is a break of conventionality in language. Understanding it is a sign of fluency (Tagnin, 2005), which explains the interest of computational linguistics on this topic. Humour is generally based on four linguistic phenomena, at the written and oral levels (Tagnin, 2005), namely: homonymy -words with the same spelling and sound, e.g., 'band', musical group or ring -, homophony -words with the same sound, but different spellings, e.g., cent and scent -, polysemy -words with the same spelling and sound, but multiple related meanings, e.g., wood, timber or forest -, and paronymy -words with similar spellings or sounds, e.g., collision and collusion. As such, its success depends highly on how proficient the audience is in the language used.
According to Attardo (2008), humorous plots can rely on one of the following: a punch line (typical jokes); a meta-narrative disruption, introducing familiar humorous references; or an otherwise normal situation, where some elements may cause laughter by the shortcomings of the agents.

Riddle Generation Approach
SECO explores lexical resources in Portuguese for generating word and feature combinations. Those can be rendered as novel riddles, to be used for human entertainment purposes, and can ultimately have some humour value, and thus be seen as punning riddles. If we see them as jokes, following Attardo (2008)'s classification, SECO falls under the first type, as it presents a humorous plot (question) with a punchline (answer).
As in JAPE (Binsted and Ritchie, 1994), riddle generation may follow different implemented models, all resorting to a (lexical-semantic) knowledge base (KB), in order to acquire features of given lexicalised concepts. A parallelism can roughly be made between our approach for generating riddles and the common architecture of a NLG system (Reiter and Dale, 2000), as it encompasses the following steps: • Model Instantiation & Feature Acquisition (roughly, Content Determination): sets the model and exploits the KB for a combination of initial concept and related features; • Riddle Creation (roughly, Microplanning): selects the appropriate words for denoting different types of feature and sets how the selected combination is going to be presented as text, i.e., as a definition or as a questionanswering pair; • Rendering (Surface Realisation): renders the combination as text, after some adaptations that make it more natural.
This section describes the previous steps in more detail and ends with some examples.

Model Instantiation & Features
All generation models implemented start with an initial concept with two detachable parts (c 1 , c 2 ), either lexicalised as a compound (e.g., human rights) or a single word that, based on its orthography, may be divided in two (e.g., knowl-edge=know+ledge). Each part is considered individually and features are retrieved from the KB, some involving the first part of the concept (in set F 1 ), others the second part (F 2 ). Features are represented as triples of the kind a relatedTo b, where a and b are words and relatedTo is the name of a semantic relation between meanings of a and b (e.g., animal hypernymOf dog). As such, every feature in f 1 ∈ F 1 and f 2 ∈ F 2 will consist of a relation involving, respectively, c 1 and c 2 .
After this, features f 1 ∈ F 1 are paired with features f 2 ∈ F 2 . The result is a set of combinations of the initial concept with pairs {fw 1 , fw 2 }. Figure 1 illustrates this step, with d 1 and d 2 representing textual descriptions of the features that connect c 1 to fw 1 and c 2 to fw 2 , respectively. Two lexical resources are used by all the riddle generation models and are thus essential to this work. A lexical-semantic knowledge base (KB) with relation instances that occur in at least three out of ten semantic networks for Portuguese, including dictionaries, wordnets and Concept-Net (see Gonçalo Oliveira (2018) for additional details), that has a total of 45,510 instances, covering a rich set of relation types (see section 3.2), and is used by all the models. A morphology lexicon (Ranchhod et al., 1999) with more than 900,000 Portuguese word forms, their part-of-speech and other grammatical information. It is used by the models to handle inflections and can also be used as a source of words (e.g., in models based on a single word).

Implemented Models
Inspired both by the schemata of JAPE (Binsted and Ritchie, 1994) and by the kind of jokes that Portuguese children use to make, we implemented six riddle generation models. A key difference is that they produce riddles in Portuguese and are thus constrained by the available lexical resources in this language. All models instantiate the generic model of Figure 1, but follow a different intuition, reflected in their input and, for the last two, on the features explored.
Reinterpretation of compounds (RC): given a known noun+adjective compound (c 1 + c 2 ), features are acquired for each of its words individually, and used to (re-)define it. Our intuition is that the meaning of the compounds is more than just the sum of the meanings of both of their words, which may result in unexpected associations, possibly perceived as incongruent, and thus humourprone. The input for this model was a list of 180 Portuguese noun+adjective compounds (Ramisch et al., 2016), with instances such aságua doce (fresh water), mau-humor (bad mood), or primeira mão (first hand/leg).
New compounds (NC): explores the idea that humour may result from new words with familiar sounds (homophony, paronymy). It generates new compounds from all pairs of valid words with an edit distance of 1 letter to the original compounds, and then works as RC. With the same list as RC, instances like amido oculto (occult starch), véu aberto (open veil) or primeiro pano (first cloth) are obtained, respectively from amigo oculto (occult friend), céu aberto (open sky) and primeiro plano (first plan).
Reinterpretation of words (RW): instead of compounds, this model is based on single words, interpreted as a blend of two. This often leads to an unexpected meaning, attributed to the word, again perceived as incongruent, thus increasing the humour potential. Target words are in the lexicon and have the form w 1 w 2 , where w 1 and w 2 are character sequences that are also in the lexicon. Instances of this kind include malabar (juggle) -interpreted as mala+bar (suitcase+bar) -, centralidade (centrality) -interpreted as cen-tral+idade (central+age) -, or restolho (stubble) -interpreted as resto+olho (rest+eye).
New blends (NB): analogous to NC, but in respect to RW instead of RC. Before trying to reinterpret words, a small change is made in their orthography, according to a short handcrafted list of possible character sequence replacements that do not change the sound of the word excessively. Those are sequences like {on, un,ão}, {i, e} or {r, l}, which can be replaced in lexicon words and result in instances such as fundasom -obtained from fundação (foundation) and split into funda (catapult) and som (sound) -, bombesta -from bombista (bomber), split into bom (good) and besta (beast) -, or calavelafrom caravela (caravel), split into cala (shut) and vela (sail). Each part of the instance (c 1 and c 2 ) must be in the lexicon, and thus be valid, but their blend (w 1 w 2 ) is not necessarily an existing word. The result is a new concept with a sound that resembles a known one, interpreted as the blend of two other (often unrelated) concepts.
Partial antonyms (PA): assumes that the orthography of some words starts or ends with the antonym of another and that novel antonyms may result from changing the start/end of those words with its antonym. Instances of this kind include pormenor (detail) -where menor (smaller) is antonym of maior (bigger), resulting in novel antonym pormaior -, or diante (in front of)where dia (day) is antonym of noite (night), resulting in antonym noitente. To avoid atypical words, some considerations on how syllables are formed and their limits are made. For instance, with a direct replacement, the antonyms of sair, assalto, salvo and negativo would be, respectively, saandar, assbaixo, snegro and negpassivo. With those considerations, they become sandar, asbaixo, salnegro and nepassivo.
Antonymy Blend (AB): revisits the idea that, due to their orthography, some words can be interpreted as a blend of other two. Yet, it is focused on those where both parts (c 1 , c 2 ) have an antonym. This is used in the generation of novel antonyms, such as odiar-atingir (hatereach) -for amarfalhar (to crumple), interpreted as amar+falhar (to love+to fail) -, sombrailegal (illegal-shadow), for solícito (solicitous), interpreted as sol+lícito (sun+lawful) -, or maumau (bad-bad) -for bombom (candy), interpreted as bom+bom (good+good).

Riddle Creation
Riddles have to be conveyed in natural language in a way that the relation between the concept and the features is understood. Depending on the types of feature, the text to clarify the relation is set, before selecting how the riddle is to be presented.

Feature Description
In the KB, relation instances connect words according to one of their meanings, but different meanings of the same word are not explicitly identified. Though unintentionally, this enables the selection of less expected meanings because the chance of selecting exactly the same meaning a word has in a compound is low, especially when the global meaning is not the sum of the parts (c 1 , c 2 ) meaning.
As the relations used are present in at least three other resources, they are generally well-known and consensual. If this constraint is dropped, more features can be extracted, though less immedi-ate. The risk of using incorrect features is also higher, because most available semantic networks for Portuguese are created (semi-)automatically, with minimal or no curation.
From the available relations, we identified a subset of types that could be used as features. Some of those types are listed in Table 1, together with their frequency in the KB, and the text of their description, in Portuguese, to be used in the riddles, followed by a rough English translation. As only synonymy and antonymy are symmetrical relations, some types have different descriptions, depending on the position of the concept word (c i ) and feature word (fw i ) in the relation. There are also relations for which no text is necessary besides the feature word itself. Finally, if fw i is a noun, it will be preceded by an indefinite article (um, for masculine, uma for feminine).

Presentation Mode Selection
Each combination of an initial concept and related features can be presented in different (hard-coded) ways, such as those in Table 2. Those include a definition (DEF); a question with two features for which the initial concept is the answer (FC); a question with the initial concept for which the features are in the answer (CF); or questions that ask for the opposite of the initial concept / pair of features, answered with a pair of features / initial concept (OP1, OP2).

Rendering
The riddle is finally rendered as a text with the concept and the features description. Yet to make text more natural, additional adaptations have to be made, depending on the presentation mode and  A1 Noun features should appear before other features. A2 If one feature is a noun and the other an adjective, the description of the adjective is removed, because it is clear enough that the adjective is modifying the noun. A3 If one feature is a noun and the other is an adjective, the adjective must have the same gender as the noun and is inflected according to the lexicon. A4 If both features are nouns, queé (what is) is added before the second, and both can be interpreted as 'co-hypernyms.' A5 If both features are adjectives, the description of the second is removed. A6 If one feature is an adjective but can also be used as a noun, it is used as a noun. A7 When necessary, the conjunction e (and) is added between the feature descriptions.
Adaptations A1 to A5 apply to the presentation modes DEF, CF and OP1; A6 is applied to FC; and A7 is applied to all.

Dissected Examples
To illustrate the previous steps, we present some examples of generated riddles. Consider, for example, the concept direitos humanos (human rights), in the compounds list, for which the features acquired from the KB include: direito synonymOf liso (right, flat) direito synonymOf plano (right, plane) direito antonymOf torto (right, bent) humano saidAbout homem (human, man) From those, the riddles in Table 3 could be the result of the RC model. The instantiation of the generic model that results in the fifth riddle is depicted in Figure 2.
Despite different possible ways for presenting the riddles, a default one was set for each model, namely: FC for RC, except when one of the features is an antonym, with OP2 being used instead; CF for NC, RW, NB; OP1 for PA and AB. Table 4 has examples of riddles generated by each model, covering every rendering adaptation, and including a rough English translation.

Validation
The assessment of the riddles includes highly subjective aspects, namely humour. Also given that the riddles were created for human consumption, we decided to validate our results with the opinion of a general audience. For this purpose, a sample of produced riddles was generated and deployed to the Figure-Eight 1 crowdsourcing platform, in a job called Adivinhas em Português (Riddles in Portuguese), to be answered by human judges from Portugal and Brazil, paid for this purpose. This section describes the assessed aspects, how the sample was generated, and discusses the obtained results.

Assessed Aspects
Since our goal was to produce novel riddles, ideally with humour value, both novelty and humour potential had to be assessed. We also wanted to 1 https://www.figure-eight.com/ know if the audience could understand the riddle and if it actually made surprising associations. Subjects were not informed that the riddles had been produced automatically. They were just instructed to use a 5-point Likert scale for scoring the four aspects, given the descriptions in Table 5.

Validation Sample
A 300-riddle sample was created for validation, all rendered with the default presentation mode (see section 3.2.2) and randomly selected from the top-150 riddles by each model, ranked roughly based on the commonality and representativeness of their features. Our intuition was that using more common words in the riddle would improve its understanding. So, the higher the frequency of their features in the CETEMPúblico (Rocha and Santos, 2000) corpus, the higher the rank. Yet, features should be as representative as possible of the concept. For instance, asking for the hyponym of a word with too many hyponyms (e.g., person, plant, instrument) decreases solvability, an aspect considered by Labutov and Lipson (2012). So, the rank was penalised according to the number of other words related the same way as the concept is to the used features.

Results
Each riddle had the four aspects rated by three subjects, enabling us to compute the agreement on answers we knew to be subjective. Table 6 shows the global results of this validation. Considering the sample and rated aspects, it shows the proportion of answers with scores from 1 to 5, three statistical measures and the judge agreement. The latter confirms the subjectivity involved in this validation. Curiously, humour potential is where agreement is higher, due to its lower average rating. Only 6% of the riddles clearly made the subject laugh, but subjects also say that more than 30% could make someone else laugh. On the other hand, 30% of the riddles would make no one laugh. We are not completely satisfied with the results on this aspect but, given the underlying subjectivity, we still think they are interesting and provide a baseline for further iterations of SECO. On average, interpretation was accessible, which means that most riddles could be understood, but only after a second reading. Both surprise and, especially, novelty were rated higher than interpretation. This means that the riddles use some unexpected associations and the majority is  the riddle sounds like some other riddle you know; 5 Novel the riddle is unknown and different from those you know. Humour potential 1 None the riddle will make no one laugh; 3 Some the riddle did not make you laugh, but could make someone laugh (child or adult); 5 Great the riddle made you laugh and has a great humour potential. Table 5: Assessed aspects and their description effectively novel and previously unknown by the subjects, which was one of our goals. Table 7 has a selection of riddles. A rough English translation was included for each but, in most cases, it hardly transmits the actual meaning in Portuguese. The selection highlights riddles that stand out in some aspect, including riddles with high novelty (1-3) and humour potential (4-6). Some riddles with low interpretation (7 and others not in the table) are longer than the average, which may explain this rate, but for others we have found no explanation besides, possibly, laziness (8,9). To stress the underlying subjectivity, we included riddles where subjects highly disagreed on the hu-mour potential (10-14), for which the three given ratings on this aspect are shown. One of them (10) is also one of the few with very low novelty. This is a possible cause of disagreement, as many people will not find a joke so funny once they heard it for the first time. On the other hand, this shows that SECO may generate a minority of familiar riddles, originally created by humans. The remaining riddles in the table are a personal selection with humour rated higher than average.  On a side note, 42% of the answers were from Portugal and 58% from Brazil. The average score for surprise and humour was the same for both countries and, although Portuguese judges rated interpretation and novelty 0.2 points higher, standard deviations are large. Table 8 organises the global results according to the generation model. Due both to the high standard deviations and because we are using Likert scales, it is focused on the mode and median. It is clear that not all models result in riddles with the same quality. AB riddles seem to be the easiest to understand, the lowest surprise is for RW, and the highest novelty is for NB, PA and AB. On humour, two models should be highlighted for producing riddles with higher average potential: RC and AB. Yet, so far we have no better explanation than a higher acceptance of this kind of riddles as (What is the opposite of procurator (demand+pain)? supply-joy.) Table 7: Selection of generation riddles with their average validation scores jokes by the majority of the subjects. For instance, the FC presentation, used by the RC model, relies on a linguistic construction common in jokes.

Conclusion
We have described six models for generating novel riddles, in Portuguese, from a knowledge-base, incorporated in the system SECO. Even though human validation confirmed the subjectivity of their appreciation, overall, subjects found the riddles accessible, surprising, and novel, though with a low humour potential, especially for riddles produced by four of those models. We also highlight that the generated riddles are in Portuguese, a relevant contribution, given that, to the best of our knowledge, there was no such previous work of this kind on this language. The ability of understanding and, in this specific case, using humour, are important steps towards making artificial agents more natural. In principle, the proposed models could be integrated in one of such agents. We have plans for a Twitter bot that will use them, but are still testing different ways of selecting riddles appropriate for a given context (e.g., recent news / top trends).
Furthermore, we aim at better analysing the results of the human validation including, for instance, correlations between different rated aspects. Using the same criteria for scoring riddles that humans actually use would provide useful insights, but it would require a great number of "consensual" riddles of the tackled kinds, which is highly unlikely. In further iterations of SECO, we will devise the inclusion of alternative models of riddles or humour (e.g., What's the difference between...) and we plan to study different ways of ranking the produced riddles automatically, not only based on commonality and representativeness, but also on humour-relevant features (e.g., incongruity, presence of taboo words).