Pun-GAN: Generative Adversarial Network for Pun Generation

In this paper, we focus on the task of generating a pun sentence given a pair of word senses. A major challenge for pun generation is the lack of large-scale pun corpus to guide supervised learning. To remedy this, we propose an adversarial generative network for pun generation (Pun-GAN). It consists of a generator to produce pun sentences, and a discriminator to distinguish between the generated pun sentences and the real sentences with specific word senses. The output of the discriminator is then used as a reward to train the generator via reinforcement learning, encouraging it to produce pun sentences which can support two word senses simultaneously. Experiments show that the proposed Pun-GAN can generate sentences that are more ambiguous and diverse in both automatic and human evaluation.


Introduction
Generating creative and interesting text is a key step towards building an intelligent natural language generation system. A pun is a clever and amusing use of a word with two meanings (word senses), or of words with the same sound but different meanings (Miller and Gurevych, 2015). In this paper, we focus on the former type of pun, i.e., homographic pun. For example, "I used to be a banker but I lost interest" is a pun sentence because the pun word "interest" can be interpreted as either curiosity or profits.
An intractable problem for pun generation is the lack of a large-scale pun corpus in which each pun sentence is labeled with two word senses. Early * Equal Contribution. 1 The code is available at: https://github.com/ lishunyao97/Pun-GAN. researches (Hong and Ong, 2009;Valitutti et al., 2013;Petrovic and Matthews, 2013) are mainly based on templates and rules, thus lacking creativity and flexibility.  is the first endeavor to apply neural network to this task, which adopts a constrained neural language model (Mou et al., 2015) to guarantee that a pre-given word sense to appear in the generated sequence. However,  only integrates the generation probabilities of two word senses during the inference decoding process, without detecting whether the generated sentences can support the two senses indeed during training. Promisingly, Word Sense Disambiguate (WSD) (Pal and Saha, 2015) which aims at identifying the correct meaning of the word in a sentence via a multi-class classifier, can help the detection of pun sentences to some extent. Based on the above motivations, we introduce Generative Adversarial Net (Goodfellow et al., 2014) into pun generation task. Specifically, the generator can be any model that is able to generate a pun sentence containing a given word with two specific senses. The discriminator is a word sense classifier to classify the real sentence to its correct word sense label and classify a generated pun sentence to a fake label. With such a framework, the discriminator can provide a well-designed ambiguity reward to the generator, thus encouraging the ambiguity of the generated sentence via reinforcement learning (RL) to achieve the goal of punning, without using any pun corpus.
Evaluation of the pun generation is also challenging. We conduct both automatic and human evaluations. The results show that the proposed Pun-GAN can generate a higher quality of pun sentence, especially in ambiguity and diversity.

Model
The sketch of the proposed Pun-GAN is depicted in Figure 1. It consists of a pun generator G θ and a word sense discriminator D φ . The following sections will elaborate on the architecture of Pun-GAN and its training algorithm.

Generator
Given two senses (s 1 , s 2 ) of a target word w, the generator G θ aims to output a sentence x which not only contains the target word w but also express the two corresponding meanings. Considering the simplicity of the model and the ease of training, we adopt the neural constrained language model of  as the generator. Due to space constraints, we strongly recommend that readers refer to the original paper for details. Compared with traditional neural language model, the main difference is that the generated words at each timestep should have the maximum sum of two probabilities which are calculated with s 1 and s 2 as input, respectively. Formally, the generation probability over the entire vocabulary at t-th timestep is calculated as is the hidden state of t-th step when taking s 1 (s 2 ) as input, f is the softmax function, and x <t is the preceding t − 1 words. Therefore, the generation probability of the whole sentence x is formulated as To give a warm start to the generator, we pretrain it using the same general training corpus in the original paper.

Discriminator
The discriminator is extended from the word sense disambiguation models (Kågebäck and Salomonsson, 2016;Luo et al., 2018a,b). Assuming the pun word w in sentence x has k word senses, we add a new "generated" class. Then, the discriminator is designed to produce a probability distribution over k + 1 classes, which is computed as where c is the context vector from a bi-directional LSTM when taking x as input, U w is a wordspecific parameter and y is the target label. Therefore, D φ y = i|x, i ∈ {1, ..., k} denotes the probability that it belongs to the real i-th word sense, while D φ (y = k + 1|x) denotes the probability that it is produced by a pun generator.

Training
We follow the training techniques of Salimans et al. (2016) which applys GAN to semisupervised learning. For real sentence x, if it is sense labeled, D φ should classify x to its correct word sense label y, otherwise D φ should classify x to anyone of the k labels. For generated sentence x, D φ should classify x to the (k + 1)-th generated label. Thus, the training objective of the discriminator is to minimize: where p data denotes the sentence which only supports one word sense.
To encourage the generator to produce pun text, the discriminator is required to assign a higher reward to the ambiguous pun text which can be interpreted as two meanings simultaneously. For pun sentence, the probability of the target two sense D φ (s 1 |x) and D φ (s 2 |x) should not only have a small gap, but also account for the most. For example, (0.1, 0.5, 0.4) and (0.1, 0.8, 0.1) are two probability distributions outputted from D φ . The former is more likely to be a pun with the second (0.5) and third (0.4) meaning, while the latter is mostly a generic single sense sentence with the second meaning (0.8). Based on the above observations, the reward is designed as where 1 is a coefficient that avoids the denominator being zero. Then, the goal of generator training is to minimize the negative expected reward.
where x (k) is the k-th sampled sequence, r (k) is the reward of x (k) . By means of policy gradient method (Williams, 1992), for each pair of senses (s 1 , s 2 ), the expected gradient of Eq. 6 can be approximated as: where K is the sample size. Similar to other GANs (Salimans et al., 2016;Yu et al., 2017), the generator and discriminator are trained alternatively.

Dataset
Training Dataset: To keep in line with previous work , we use a generic corpus -English Wikipedia to train Pun-GAN. For generator, we first tag each word in the English Wikipedia corpus with one word sense using an unsupervised WSD tool 2 . Then we use the 2,595K tagged corpus to pre-train our generator. For discriminator, we use several types of data for training: 1) SemCor (Luo et al., 2018a,b) which is a manually annotated corpus for WSD, consisting of 226K sense annotations 3 (first part in Eq.4); 2) Wikipedia corpus as unlabeled corpus (second part in Eq.4); 3) Generated puns (third part in Eq.4).
Evaluation Dataset: We use the pun dataset from SemEval 2017 task7 (Miller et al., 2017) for evaluation. The dataset consists of 1274 humanwritten puns where target pun words are annotated with two word senses. During testing, we extract the word sense pair as the input of our model.

Experimental Setting
The generator is the same as . The discriminator is a single-layer bi-directional LSTM with hidden size 128. We randomly initialize word embeddings with the dimension size of 300. The sample size K is set as 32. Batch size is 32 and learning rate is 0.001. The optimization algorithm is SGD. Before adversarial training, we pre-train the generator for 5 epochs and pre-train the discriminator for 4 epochs. In adversarial training, the generator is trained every 1 step and the discriminator is trained every 5 steps.  (Mou et al., 2015) 2.0 2.1 2.0 CLM+JD  3.4 3.6 3.5 Pun-GAN 3.9 3.7 3.8

Baselines
We compare with the following systems: LM (Mikolov et al., 2010): It is a normal recurrent neural language model which takes the target pun word as input.
CLM (Mou et al., 2015): It is a constrained language model which guarantees that a pre-given word will appear in the generated sequence.
CLM+JD : It is a state-of-theart model for pun generation which extends a constrained language model by jointly decoding conditioned on two word senses.

Evaluation Metrics
Automatic evaluation: We use two metrics to automatically evaluate the creativeness of the generated puns in terms of unusualness and diversity. Following Pauls and Klein (2012) and He et al. (2019) 4 , the unusualness is measured by subtracting the log-probability of training sentences from the log-probability of generated pun sentences. Following , the diversity is measured by the ratio of distinct unigrams (Dist-1) and bigrams (Dist-2) in generated sentences.
Human evaluation: Three annotators score the randomly sampled 100 outputs of different systems from 1 to 5 in terms of three criteria. Ambiguity evaluates how likely the sentence is a pun. Fluency measures whether the sentence is fluent. Overall is a comprehensive metric.    Table 1 and Table 2 show the results of automatic evaluation and human evaluation, respectively. We find that: 1) Pun-GAN achieves the best ambiguity score. This is in line with our expectations that adversarial training can better achieve the aim of punning; 2) Compared with CLM+JD which is actually the same as our pre-trained generator, Pun-GAN has a large improvement in unusualness. We assume that it is because the discriminator can promote to generate more creative and unexpected sentences to some extent via adversarial training; 3) Pun-GAN can generate more diverse sentence with different tokens and words. This phenomenon accords with previous work of GANs (Wang and Wan, 2018).

Results
In addition, Table 3 shows the A/B tests between two the models. It shows that Pun-GAN can generate more vivid pun sentences compared with the previous best model CLM+JD. However, there still exists a big gap between generated puns and human-written puns. To conclude, both automatic evaluation and human evaluation show the effectiveness of the proposed Pun-GAN, especially in ambiguity and diversity.

Ablation Study
In order to validate the effectiveness of adversarial learning, we fix the discriminator after pretraining. Table 4 shows the results, from which we can conclude that adversarial leaning can help improve the creativeness of generated puns. Pun-GAN It is a touch with the red sox.

Human
The massage which came with the spa treatment was a nice touch. state s1: an organized political community forming part of a country.
s2: mode or condition of being.

CLM+JD
According to the state, he was the first time in the united states.
Pun-GAN In the state, the national assembly was established.

Human
Many people need to learn to be happy with the state they are in. GAN. Human-written puns are also given. It demonstrates that, compared with CLM+JD, Pun-GAN can generate puns which are closer to the funniness and creativeness of human-written puns. However, both CLM+JD and Pun-GAN may sometimes generate short sentences. Since too short sentences lack sufficient context, they always tend to ambiguous. More analysis can be found in Section 3.8.

Error Analysis
We carefully analyze the generated results of Pun-GAN with low overall scores in human evaluation. Fig 3 shows the proportion of different error types. The most common type of error is generating a sentence which only supports a single word sense. This accords with our expectations since generating a sentence which can support two word senses without any labeled corpus is very hard. Another common type of error is generating over generic sentences. For example, "It is a bank". In most instances, these generic sentences are always very short and they begin with a pronoun like "It is" or "He can". The reasons are two-fold. One is that these type of sentences can get a high generation probability since the generator is actually a language model. The other is these type of sentences can even get a not bad reward since they are indeed ambiguous. Moreover, grammar error also accounts for about 1/5. We hypothesize that it is caused by the joint generation process in Eq.2.

Conclusion and Future Work
In this paper, we propose Pun-GAN: a generative adversarial network for pun generation. It consists of a pun generator and a word sense discriminator, which unifies the task of pun generation and word sense disambiguation. Even though Pun-GAN does not require any pun corpus, it can still enhance the ambiguity of sentence produced by the generator via the reward from the discriminator to achieve the goal of punning. Pun-GAN is generic and flexible, and may be extended to other constrained text generation tasks in future work.