PunFields at SemEval-2017 Task 7: Employing Roget's Thesaurus in Automatic Pun Recognition and Interpretation

The article describes a model of automatic interpretation of English puns, based on Roget's Thesaurus, and its implementation, PunFields. In a pun, the algorithm discovers two groups of words that belong to two main semantic fields. The fields become a semantic vector based on which an SVM classifier learns to recognize puns. A rule-based model is then applied for recognition of intentionally ambiguous (target) words and their definitions. In SemEval Task 7 PunFields shows a considerably good result in pun classification, but requires improvement in searching for the target word and its definition.


Introduction
The following terminology is basic in our research of puns. A pun is a) a short humorous genre, where a word or phrase is intentionally used in two meanings, b) a means of expression, the essence of which is to use a word or phrase so that in the given context the word or phrase can be understood in two meanings simultaneously. A target word is a word, that appears in two meanings. A homographic pun is a pun that "exploits distinct meanings of the same written word" (Miller and Gurevych, 2015) (these can be meanings of a polysemantic word, or homonyms, including homonymic word forms). A heterographic pun is a pun, in which the target word resembles another word, or phrase in spelling; we will call the latter the second target word. Consider the following example (the Banker joke): "I used to be a banker, but I lost interest." The Banker joke is a homographic pun; interest is the target word. Unlike it, the Church joke below is a heterographic pun; propane is the target word, profane is the second target word: "When the church bought gas for their annual barbecue, proceeds went from the sacred to the propane." Our model of automatic pun analysis is based on the following premise: in a pun, there are two groups of words, and their meanings, that indicate the two meanings in which the target word is used. These groups can overlap, i.e. contain the same polysemantic words, used in different meanings.
In the Banker joke, words, and collocations banker, lost interest point at the professional status of the narrator, and his/her career failure. At the same time, used to, lost interest tell a story of losing emotional attachment to the profession: the narrator became disinterested. The algorithm of pun recognition, which we suggest, discovers these two groups of words, based on common semes 1 (Subtask 1), finds the words, which belong to the both groups, and chooses the target word (Subtask 2), and, based on the common semes, picks up the best suitable meaning, which the target word exploits (Subtask 3). In case of heterographic puns, in Subtask 2, the algorithm looks for the word, or phrase, which appears in one group and not in the other.

Subtask 1: Mining Semantic Fields
We will call a semantic field a group of words and collocations, which share a common seme. In taxonomies, like WordNet (Kilgarriff and Fellbaum, 2000), and Roget's Thesaurus (Roget, 2004) (further referred to as Thesaurus), semes appear as hierarchies of word meanings. Top-levels attract words with more general meanings (hypernyms). For example, Thesaurus has six top-level Classes, that divide into Divisions, that divide into Sections, and so on, down to the fifth lowest level. WordNet's structure is not so transparent. CITE!!! 10 TOP-semes Applying such dictionaries to get semantic fields (the mentioned common groups of words) in a pun is, therefore, the task of finding two most general hypernyms in WordNet, or two relevant Classes among the six Classes in Thesaurus. We chose Thesaurus, as its structure is only five levels deep, Classes labels are not lemmas themselves, but arbitrary names (we used numbers instead), and it allows parsing on a certain level, and insert corrections (adding lemmas, merging subsections, etc. 2 ). After some experimentation, instead of Classes, we chose to search for relevant Sections, which are 34 subdivisions of the six Classes 3 .
After normalization (including change to lowercase; part-of-speech tagging, tokenization, and lemmatization with NLTK tools (Bird et al., 2009); collocation extraction 4 ; stop-words removal 5 ), the algorithm collects Section numbers for every word, and collocation, and removes duplicates (in Thesaurus, homonyms proper can belong to different subdivisions in the same or different Sections). Table 1 shows what Sections words of the Banker joke belong to.
Then the semantic vector of a pun is calculated. Every pun p is a vector in a 34-dimensional space: The value of every element s ki equals the number of words in a pun, which belong to a Section S k . The algorithm passes from a Section to a Section, each time checking every word w ji in the bunch of extracted words l i . If a word belongs to a Section, the value of s ki RAISES BY???? 1: For example, the semantic vector of the Banker joke looks as follows: see Table 2.
To test the algorithm, we, first, collected 2484 puns from different Internet resources and, second, built a corpus of 2484 random sentences of length 5 to 25 words from different NLTK corpora (Bird et al., 2009) plus several hundred aphorisms and proverbs from different Internet sites. We shuffled and split the sentences into two equal groups, the first two forming a training set, and the other two a test set. The classification was conducted, using different Scikitlearn (Pedregosa et al., 2011) algorithms. We also singled out 191 homographic puns, and 198 heterographic puns, and tested them against the same number of random sentences. In all the tests 6 , the Scikit-learn algorithm of SVM with the Radial Basis Function (RBF) kernel produced the highest average F-measure results (f = fpuns+f random 2 ). In addition, its results are smoother, comparing the difference between precision, and recall (which leads to the highest F-measure scores) within the two classes (puns, and random sentences), and between the classes (average scores). Table 3 illustrates results of different algorithms in class "Puns" (not average results between puns, and not puns). The results were higher for the split selection, reaching 0.79 (homographic), and 0.78 (heterographic) scores of F-measure. The common selection got the maximum of 0.7 for average Fmeasure in several tests. The higher results of split selection may be due to a larger training set.

Subtask 2: Hitting the Target Word
We suggest that, in a homographic pun, the target word is a word, which immediately belongs to two semantic fields; in a heterographic pun, the target word belongs to at least one discovered semantic field, and does not belong to the other. However, in reality, words in a sentence tend to belong to too many fields, and they create noise in the search. To reduce influence of noisy fields, we included such non-semantic features in the model as the tendency of the target word to occur at the end of a sentence, and part-of-speech distribution, given in (Miller and Gurevych, 2015). A-group (W A ) and B-group (W B ) are groups of words in a pun, which belong to the two semantic fields, sharing the target word. Thus, for some s ki , k becomes A, or B 7 . A-group attracts the maximum number of p Banker {1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 2, 0, 1, 0, 0, 2, 1, 1, 0, 0, 0, 4, 2, 0, 0}  In the Banker joke, s Ai = 4, A = 30 (Possessive Relations); words, that belong to this group, are use, lose, banker, interest. B-group is the second largest group in a pun: In the Banker joke, s Bi = 2. There are three groups of words, which have two words in them: B 1 = 19, Results Of Reasoning: be, lose; B 2 = 24, Volition In General: use, interest; B 3 = 31, Affections In General: banker, interest. Ideally, there should be a group of about three words, and collocations, describing a person's inner state (used to be, lose, interest), and two words (lose, interest) in W A are a target phrase. However, due to the shortage of data about collocations in dictionaries, W B is split into several smaller groups. Consequently, to find the target word, we have to appeal to other word features. In testing the system on homographic puns, we relied on the polysemantic character of words. If in a joke, there are more than one value of B, W B candidates merge into one, with duplicates removed, and every word in W B becomes the target word candidate: c ∈ W B . In the Banker joke, W B is a list of be, lose, use, interest, banker; B = {19, 24, 31}. Based on the definition of the target word in a homographic pun, words from W B , that are also found in W A , should have a privilege. Therefore, the first value v α , each word gets, is the output of the Boolean function: The second value v β is the absolute frequency of a word in the union of B 1 , B 2 , etc., including The third value v γ is a word position in the sentence: the closer the word is to the end, the bigger this value is. If the word occurs several times, the algorithm counts the average of the sums of position numbers.
The fourth value is part-of-speech probability v δ . Depending on the part of speech, the word be- The final step is to count rates, using multiplicative convolution, and choose the word with the maximum rate: Values of the Banker joke are illustrated in Table 4.
In the solution for heterographic puns, we built a different model of B-group. Unlike homographic puns, here the target word is missing in W B (the reader has to guess the word or phrase, homonymous to the target word). Accordingly, we rely on the completeness of the union of W A and W B : among the candidates for W B (the second largest groups), such groups are relevant, that form the longest list with W A (duplicates removed). In Ex. 2 (the Church joke), W A = go, gas, annual, barbecue, propane, and two groups form the largest union with it: W B = buy, proceeds + sacred, church. Every word in W A and W B can be the target word. The privilege passes to words, used only in one of the groups. Ergo, the first value is: Frequencies are of no value here; values of position in the sentence, and part-of-speech distribution remain the same. The function output is: Values of the Church joke are illustrated in Table 5.

Subtask 3: Mapping Roget's Thesaurus to Wordnet
In the last phase, we implemented an algorithm which maps Roget's Sections to synsets in Wordnet. In homographic puns, definitions of a word in Wordnet are analyzed similarly to words in a pun, when searching for semantic fields, the words belong to. For example, words from the definitions of the synset interest belong to the following Roget's Sections: Synset(interest.n.01)=a sense of concern with and curiosity about someone or something: (21,19,31,24,1,30,6,16,3,31,19,12,2,0); Synset(sake.n.01)=a reason for wanting something done: 15, 24, 18, 7, 19, 11, 2, 31, 24, 30, 12, 2, 0, 26, 24, etc. When A-Section is discovered (for example, in the Banker joke, A=30 (Possessive Relations)), the synset with the maximum number of words in its definition, which belong to A-Section, becomes the A-synset. The B-synset is found likewise for the B-group with the exception that it should not coincide with Asynset. In heterographic puns the B-group is also a marker of the second target word. Every word in the index of Roget's Thesaurus is compared to the known target word using Damerau-Levenshtein distance. The list is sorted in increasing order, and the algorithm begins to check what Roget's Sections every word belongs to, until it finds the word that belongs to a Section (or the Section, if there is only one) in the B-group. This word becomes the second target word. Nevertheless, as we did not have many trial data, but for the four examples, released before the competition, the first trials of the program on a large collection returned many errors, so we changed the algorithm for the B-group as follows.
Homographic puns, first run. B-synset is calculated on the basis of sense frequencies (the output is the most frequent sense). If it coincides with A-synset, the program returns the second frequent

synset.
Homographic puns, second run. B-synset is calculated on the basis of Lesk distance, using builtin NLTK Lesk function (Bird et al., 2009). If it coincides with A-synset, the program returns another synset on the basis of sense frequencies, as in the first run.
Heterographic puns, first run. The second target word is calculated, based on Thesaurus and Damerau-Levenshtein distance; words, missing in Thesaurus, are analyzed as their WordNet hypernyms. In both runs for heterographic puns, synsets are calculated, using the Lesk distance.
Heterographic puns, second run. The second target word is calculated on the basis of Brown corpus (NLTK (Bird et al., 2009)): if the word stands in the same context in Brown as it is in the pun, it becomes the target word. The size of the context window is (0; +3) for verbs, (0;+2) for adjectives; (-2;+2) for nouns, adverbs and other parts of speech within the sentence, where a word is used. Table 6 illustrates competition results of our system.

Conclusion
The system, that we introduced, is based on one general supposition about the semantic structure of puns and combines two types of algorithms: supervised learning and rule-based. Not surprisingly, the supervised learning algorithm showed better results in solving an NLP-task, than the rulebased. Also, in this implementation, we tried to combine two very different dictionaries (Roget's Thesaurus and Wordnet). And, although reliability of Thesaurus in reproducing a universal semantic map can be doubted, it seems to be a quite effective source of data, still, when used in Subtask 1. The attempts to map it to Wordnet seem rather weak, so far, concerning the test results, which also raises a question: if different dictionaries treat meaning of words differently, can there be an objective and/or universal semantic map, to apply as the foundation for any WSD task?