Automatic Poetry Generation with Mutual Reinforcement Learning

Poetry is one of the most beautiful forms of human language art. As a crucial step towards computer creativity, automatic poetry generation has drawn researchers’ attention for decades. In recent years, some neural models have made remarkable progress in this task. However, they are all based on maximum likelihood estimation, which only learns common patterns of the corpus and results in loss-evaluation mismatch. Human experts evaluate poetry in terms of some specific criteria, instead of word-level likelihood. To handle this problem, we directly model the criteria and use them as explicit rewards to guide gradient update by reinforcement learning, so as to motivate the model to pursue higher scores. Besides, inspired by writing theories, we propose a novel mutual reinforcement learning schema. We simultaneously train two learners (generators) which learn not only from the teacher (rewarder) but also from each other to further improve performance. We experiment on Chinese poetry. Based on a strong basic model, our method achieves better results and outperforms the current state-of-the-art method.


Introduction
Language is one of the most important forms of human intelligence and poetry is a concise and graceful art of human language. Across different countries, nationalities and cultures, poetry is always popular, having far-reaching influence on the development of human society.
In this work, we concentrate on automatic poetry generation. Besides the long-term goal of building artificial intelligence, research on this task could become the auxiliary tool to better analyse poetry and understand the internal mechanism of human writing. In addition, these generation In recent years, neural networks have proven to be powerful on poetry generation. Some neural models are proposed and achieve significant improvement. However, existing models are all based on maximum likelihood estimation (MLE), which brings two substantial problems. First, MLE-based models tend to remember common patterns of the poetry corpus (Zhang et al., 2017), such as high-frequency bigrams and stop words, losing some diversity and innovation for generated poetry. Moreover, based on word-level likelihood, two kinds of loss-evaluation mismatch (Wiseman and Rush, 2016) arise. One is evaluation granularity mismatch. When evaluating, human experts usually focus on sequence level (a poem line) or discourse level (a whole poem), while MLE optimizes word-level loss, which fails to hold a wider view of generated poems. The other is criteria mismatch. Instead of the likelihood, humans usually evaluate poetry in terms of some criteria. In this work we focus on the main four criteria (Manurung, 2003;Zhang and Lapata, 2014;Yan, 2016;Yi et al., 2017): fluency (are the lines fluent and well-formed?), coherence (is the poem as a whole coherent in meaning and theme?), meaningfulness (does the poem convey some certain messages?), overall quality (the reader's general impression on the poem). This mismatch may make the model lean towards optimizing easier criteria, e.g., fluency, and ignore other ones.
To tackle these problems, we directly model the four aforementioned human evaluation criteria and use them as explicit rewards to guide gradient update by reinforcement learning. This is a criterion-driven training process, which motivates the model to generate poems with higher scores on these criteria. Besides, in writing theories, writing requires observing other learners (Bandura, 2001). It is also shown that writing is supported as an activity in which writers will learn from more experienced writers, such as other students, teachers, or authors (Prior, 2006). Therefore it is necessary to equip generators with the ability of mutual learning and communication. Inspired by this, we propose a novel mutual reinforcement learning schema (Figure 1), where we simultaneously train two learners (generators). During the training process, one learner will learn not only from the teacher (rewarder) but also from the other. We will show this mutual learning-teaching process leads to better results.
In summary, our contributions are as follows: • To the best of our knowledge, for the sake of tackling the loss-evaluation mismatch problem in poetry generation, we first utilize reinforcement learning to model and optimize human evaluation criteria.
• We propose a novel mutual reinforcement learning schema to further improve performance, which is transparent to model architectures. One can apply it to any poetry generation model.
• We experiment on Chinese quatrains. Both automatic and human evaluation results show that our method outperforms a strong basic method and the state-of-the-art model.

Related Work
As a desirable entry point of automatic analysing, understanding and generating literary text, the research on poetry generation has lasted for decades.
In recent twenty years, the models can be categorized into two main paradigms. The first one is based on statistical machine learning methods. Genetic algorithms (Manurung, 2003;Levy, 2001), Statistical Machine Translation (SMT) approaches (He et al., 2012;Jiang and Zhou, 2008) and Automatic Summarization approaches (Yan et al., 2013) are all adopted to generate poetry.
More recently, the second paradigm, neural network, has shown great advantages in this task, compared to statistical models. Recurrent Neural Network (RNN) is first used to generate Chinese quatrains by (Zhang and Lapata, 2014). To improve fluency and coherence, Zhang's model needs to be interpolated with extra SMT features as shown in their paper. Focusing on coherence, some works (Yi et al., 2017;Wang et al., 2016a) use sequence-to-sequence model with attention mechanism (Bahdanau et al., 2015) to generate poetry. Wang et al. (2016b) design a special Planning schema, which plans some sub-keywords in advance by a language model and then generates each line with the planned sub-keyword to improve coherence. Pursuing better overall quality, Yan (2016) proposes an iterative polishing schema to generate Chinese poetry, which refines the poem generated in one pass for several times. Aiming at enhancing meaningfulness, Ghazvininejad et al. (2016) extend user keywords to incorporate richer semantic information. Zhang et al. (2017) combine a neural memory, which saves hundreds of human-authored poems, with a sequence-to-sequence model to improve innovation of generated poems and achieve style transfer.
These neural structures have made some progress and improved different aspects of generated poetry. Nevertheless, as discussed in Section 1, the two essential problems, lack of diversity and loss-evaluation mismatch, are still challenging resulting from MLE. Compared to further adjusting model structures, we believe a better solution is to design more reasonable optimization objectives.
Deep Reinforcement Learning (DRL) first shows its magic power in automatic game playing, such as Atari electronic games (Mnih et al., 2013) and the game of Go (Silver et al., 2016). Soon, DRL is used to playing text games (Narasimhan et al., 2015;He et al., 2016) and then applied to dialogue generation .
From the perspective of poetry education, the teacher will judge student-created poems in terms of some specific criteria and guide the student to cover the shortage, which naturally accords with DRL process. Therefore we take advantage of DRL. We design four automatic rewarders for the criteria, which act as the teacher. Furthermore, we train two generators and make them learn from each other, which imitates the mutual learning of students, as a step towards multi-agent DRL in literary text generation.

Basic Generation Model
We apply our method to a basic poetry generation model, which is pre-trained with MLE. Therefore, we first formalize our task and introduce this model. The inputs are user topics specified by K keywords, W = {w k } K k=1 . The output is a poem consisting of n lines, P = L 1 , L 2 , · · · , L n . Since we take the line-by-line generation process, the task can be converted to the generation of an i-th line given previous i-1 lines L 1:i−1 and W.
We use GRU-based (Cho et al., 2014) ← − h t and s t represent the forward encoder, backward encoder and decoder hidden states respectively. For each topic word w k = c 1 , c 2 , · · · , c T k , we feed characters into the encoder and get the keyword representa- means concatenation. Then we get the topic representation by 1 : where f defines a non-linear layer. Denote the generated i-th line in decoder, Y = (y 1 y 2 . . . y T i ). e(y t ) is the word embedding of y t . The probability distribution of each y t to be generated in L i is calculated by: where W is the projection parameter. g i−1 is a global history vector, which records what has been generated so far and provides global-level information for the model. Once L i is generated, it is 1 For brevity, we omit biases in all equations. updated by a convolutional layer: where 0 is a vector with all 0-s and d is convolution window size. Then the basic model is pretrained by minimizing standard MLE loss: where M is data size and θ is the parameter set to be trained.
This basic model is a modified version of (Yan, 2016). The main differences are that we replace vanilla RNN with GRU unit, use convolution to calculate the line representation rather than directly use the last decoder hidden state, and we remove the polishing schema to better obverse the influence of DRL itself. We select this model as our basic framework since it achieves satisfactory performance and the author has done thorough comparisons with other models, such as (Yan et al., 2013) and (Zhang and Lapata, 2014).

Single-Learner Reinforcement Learning
Before presenting the single-learner version of our method (abbreviated as SRL), we first design corresponding automatic rewarders for the four human evaluation criteria.
Fluency Rewarder. We use a neural language model to measure fluency. Given a poem line L i , higher probability P lm (L i ) indicates the line is more likely to exist in the corpus and thus may be more fluent and well-formed. However, it's inadvisable to directly use P lm (L i ) as the reward, since over high probability may damage diversity and innovation. We expect moderate probabilities which fall into a reasonable range, neither too high nor too low. Therefore, we define the fluency reward of a poem P as: where µ and σ are the mean value and standard deviation of P lm calculated over all training sets. δ 1 is a hyper-parameter to control the range.
Coherence Rewarder. For poetry, good coherence means each line L i should be coherent with previous lines in a poem. We use Mutual Information (MI) to measure the coherence of L i and L 1:i−1 . As shown in (Li et al., 2016a), MI of two sentences, S 1 and S 2 , can be calculated by: where λ is used to regulate the weight of generic sentences. Based on this, we calculate the coherence reward as: where P seq2seq is a GRU-based sequence-tosequence model, which takes the concatenation of previous i-1 lines as input, and predicts L i . A better choice is to use a dynamic λ instead of a static one. Here we directly set λ = exp(−r(L i )) + 1, which gives smaller weights to lines with extreme language model probabilities.
Meaningfulness Rewarder. In dialogue generation task, neural models are prone to generate generic sentences such as "I don't know" (Li et al., 2016a;Serban et al., 2016). We observed similar issues in poetry generation. The basic model tends to generate some common and meaningless words, such as bu zhi (don't know), he chu (where), and wu ren (no one). It's quite intractable to quantify the meaningfulness of a whole poem, but we find that TF-IDF values of human-authored poems are significantly higher than values of generated ones ( Figure 2). Consequently, we utilize TF-IDF to motivate the model to generate more meaningful words. This is a simple and rough attempt, but it makes generated poems more "meaningful" from the readers perspective.
Direct use of TF-IDF leads to serious out-ofvocabulary (OOV) problem and high variance, because we need to sample poems during the training process of DRL, which causes many OOV words. Therefore we use another neural network to smooth TF-IDF values. In detail, we have: where F (L i ) is a neural network which takes a line as input and predicts its estimated TF-IDF value. For each line in training sets, we calculate standard TF-IDF values of all words and use the average as the line TF-IDF value. Then we use them to train F (L i ) with Huber loss.
Overall Quality Rewarder. The three kinds of rewards above are all based on line-level. In fact, human experts will also focus on discourselevel to judge the overall quality of a poem, ignoring some minor defects. We train a neural classifier to classify a given poem (in terms of the concatenation of all lines) into three classes: computer-generated poetry (class 1), ordinary human-authored poetry (class 2) and masterpiece (class 3). Then we get the reward by: This classifier should be as reliable as possible. Due to the limited amount of masterpieces, normal classifiers don't work well. Therefore we use an adversarial training based classifier (Miyato et al., 2017), which achieves F-1 0.96, 0.73, 0.76 for the three classes respectively on the validation set.
Based on these rewarders, the total reward is: where α j is the weight and the symbol˜means the four rewards are re-scaled to the same magnitude. As (Gulcehre et al., 2018), we reduce the variance by: where b u and σ u are running average and standard deviation of R respectively. B(P) is a neural network trained with Huber loss, which takes a poem as input and predicts its estimated reward.
DRL Process. For brevity, we use P g (·|W; θ) to represent a basic generator and use REIN-FORCE algorithm (Williams, 1992) to optimize the model, which minimizes: Training with solely Eq. (16)  Sample P m 1 ∼ P g (·|W m ; θ 1 );

6:
Sample P m 2 ∼ P g (·|W m ; θ 2 ); 7: end if 15: Update θ 1 with L M (θ 1 ), θ 2 with L M (θ 2 ); 16: end for model is easy to get lost and totally ignore the corresponding topics specified by W, leading to explosive increase of MLE loss. We use two steps to alleviate this issue. The first one is the Teacher Forcing . For each W, we estimate E(R ′ (P)) by n s sampled poems, as well as the ground-truth P g whose reward is set to max(R ′ (P g ), 0). The second step is to combine MLE loss and DRL loss as: where˜means the DRL loss is re-scaled to the same magnitude with MLE loss. Ultimately, we use Eq.(17) to fine-tune the basic model.

Mutual Reinforcement Learning
As discussed in Section 1 & 2, to further improve the performance, we mimic the mutual writing learning activity by simultaneously training two generators defined as P g (θ 1 ) and P g (θ 2 ). The two learners (generators) learns not only from the teacher (rewarders) but also from each other.
From the perspective of machine learning, one generator may not explore the policy space sufficiently and thus is easy to get stuck in the local minima. Two generators can explore along different directions. Once one generator finds a better path (higher reward), it can communicate with the other and lead it towards this path. This process could also be considered as the ensemble of different generators during the training phase.  We implement the Mutual Reinforcement Learning (abbreviated as MRL) by two methods.
Local MRL. The first one is a simple instancebased method. For the same input, suppose P 1 , P 2 are generated by P g (θ 1 ) and P g (θ 2 ) respectively. If R(P 1 ) > R(P 2 ) * (1+δ 2 ) andR j (P 1 ) > R j (P 2 ) for all j, then P g (θ 2 ) uses P 1 instead of P 2 to update itself in Eq.(16) and vice versa. That is, if a learner creates a significantly better poem, then the other learner will learn it. This process gives a generator more high-reward instances and allows it to explore larger space along a more proper direction so as to escape from the local minima.
Global MRL. During the training process, we need to sample poems from the generator, and hence local MRL may cause high variance. Instead of an instance, mutual learning can also be applied on the distribution level. We can pull the distribution of a generator towards that of the other by minimizing KL divergence of them. We detail this method in algorithm 1. The inner thought is that if learner 1 is generally better than learner 2, that is, during the creating history, learner 1 achieves higher average rewards, then learner 2 should directly learn from learner 1, rather than learn the poem itself. This process allows the generator to learn from long-period history and focus on a higher level.
In practice, we combine these two methods by simultaneously communicating high-reward samples and using KL loss, which leads to the best testing rewards (Table 1).

Data and Setups
Our corpus consists of three sets: 117,392 Chinese quatrains (CQ), 10,000 Chinese regulated verses (CRV) and 10,000 Chinese iambics (CI). As men-  tioned, we experiment on the generation of quatrain which is the most popular genre of Chinese poetry and accounts for the largest part of our corpus. From the three sets, we randomly select 10% for validation. From CQ, we select another 10% for testing. The rest are used for training. For our model and baseline models, we run Tex-tRank (Mihalcea and Tarau, 2004) on all training sets and then extract four keywords from each quatrain. Then we build four < keyword(s), poem > pairs for each quatrain using 1 to 4 keywords respectively, so as to enable the model to cope with different numbers of keywords.
For the models and rewarders, the sizes of word embedding and hidden state are 256 and 512 respectively. History vector size is 512 and convolution window size d = 3. The word embedding is initialized with pre-trained word2vec vectors. We use tanh as the activation function. For other more configurations of the basic model, we directly follow (Yan, 2016). P lm and P seq2seq are trained with the three sets. We train F (L i ) and B(P) with the CQ, CRV and 120,000 generated poems. There are 9,465 masterpieces in CQ. We use these poems, together with 10,000 generated poems and 10,000 ordinary human-authored poems to train the classifier P cl . For training rewarders, half of the generated poems are sampled and the other half are generated with beam search (beam size 20). For testing, all models generate poems with beam search.
A key point for MRL is to give the two pretrained generators some diversity, which can be achieved by using different model structures or parameters. Here we simply initialize the generators differently and train one of them for more epoches.

Models for Comparisons
We compare MRL 2 (our model, with both local and global mutual learning), GT (ground-truth, namely human-authored poems), Base (the basic model described in Section 3.1) and Mem (Zhang et al., 2017). The Mem model is the current stateof-the-art model for Chinese quatrain generation, which also achieves the best innovation so far.

Automatic Evaluation
Some previous models (He et al., 2012;Zhang and Lapata, 2014;Yan, 2016) adopt BLEU and perplexity as automatic evaluation metrics. Nevertheless, as discussed in Section 1, word-level likelihood or n-gram matching will greatly diverge from human evaluation manner. Therefore we dispense with them and automatically evaluate generated poems as follows: Rewarder Scores. The four rewarder scores are objective and model-irrelevant metrics which approximate corresponding human criteria. They  Table 3: Human evaluation results. Diacritic ** (p < 0.01) indicates MRL significantly outperforms baselines; ++ (p < 0.01) indicates GT is significantly better than all models. can reflect poetry quality to some extent. As shown in Table 1, on each criterion, GT gets much higher rewards than all these models. Compared to Base, MRL gets closer to GT and achieves 31% improvement on the weighted average reward. Mem outperforms Base on the criteria except for meaningfulness (R 3 ). This is mainly because Mem generates more distinct words (Table  2), but these words tend to concentrate on the highfrequency area, resulting in unsatisfactory TF-IDF reward. We also test different strategies of MRL. With naive single-learner RL, the improvement is limited, only 14%. With mutual RL, the improvement increases to 27%. Combining local MRL and global MRL leads to another 4% improvement. The results demonstrate our explicit optimization (RL) is more effective than the implicit ones and MRL gets higher scores than SRL. Diversity and Innovation. Poetry is a kind of literature text with high requirements on diversity and innovation. Users don't expect the machine to always generate monotonous poems. We evaluate innovation of generated poems by distinct bigram ratio as . More novel generated bigrams can somewhat reflect higher innova-tion. The diversity is measured by bigram-based average Jaccard similarity of each two generated poems. Intuitively, a basic requirement for innovation is that, with different inputs, the generated poems should be different from each other.
As shown in Table 2, Mem gets the highest bigram ratio, close to GT, benefiting from its specially designed structure for innovation. Our MRL achieves 43% improvement over Base, comparable to Mem. We will show later this satisfactory performance may lie in the incorporation of TF-IDF ( Figure 2). On Jaccard, MRL gets the best result due to the utilization of MI. MI brings richer context-related information which can enhance diversity as shown in (Li et al., 2016a). In fact, human-authored poems often contain strong diversity of personal emotion and experience. Therefore, despite prominent improvement, there is still a large gap between MRL and GT.
TF-IDF Distribution. As mentioned, the basic model tends to generate common and meaningless words. Consequently, we use TF-IDF as one of the rewards. Figure 2 shows the TF-IDF distributions. As we can see, Base generates poems with lower TF-IDF compared to GT, while MRL pulls the distribution towards that of GT, making the model generate more meaningful words and hence benefiting innovation and diversity.
Topic Distribution. We run LDA (Blei et al., 2003) with 20 topics on the whole corpus and then inference the topic of each generated poem. Figure  3 gives the topic distributions. Poems generated by Base center in a few topics, which again demonstrates the claim: MLE-based models tend to remember the common patterns. In contrast, humanauthored poems spread on more topics. After finetuning by our MRL method, the topic distribution shows better diversity and balance.

Human Evaluation
From the testing set, we randomly select 80 sets of keywords to generate poems with these mod- Figure 4: The learning curves of SRL and MRL. Learner (generator) 2 is pre-trained for more epoches to allow some diversity. els. For GT, we select poems containing the given words. Therefore, we obtain 320 quatrains (80*4). We invite 12 experts on Chinese poetry to evaluate these poems in terms of the four criteria: fluency, coherence, meaningfulness and overall quality and each needs to be scored in a 5-point scale ranging from 1 to 5. Since it's tiring to evaluate all poems for one person, we randomly divide the 12 experts into three groups. Each group evaluates the randomly shuffled 320 poems (80 for each expert). Then for each model, each poem, we get 3 scores on each criterion and we use the average to alleviate individual preference. Table 3 gives human evaluation results. MRL achieves better results than the other two models. Since fluency is quite easy to be optimized, our method gets close to human-authored poems on Fluency. The biggest gap between MRL and GT lies on Meaning. It's a complex criterion involving the use of words, topic, emotion expression and so on. The utilization of TF-IDF does ameliorate the use of words on diversity and innovation, hence improving Meaningfulness to some extent, but there are still lots to do.

Further Analyses and Discussions
In this section we give more discussions.
Learning Curve. We show the learning curves of SRL and MRL in Figure 4. As we can see, for SRL, the adequately pre-trained generator 2 al-ways gets higher rewards than the other one during the DRL training process. With the increase of training steps, the gap between their rewards gets larger. After several hundred steps, rewards of the two generators converge.
For MRL, generator 2 gets higher rewards at the beginning, but it is exceeded by generator 1 since generator 1 learns from it and keeps chasing. Finally, the two generators converge to higher rewards compared to SRL.
Case Study. We show some generated poems in Figure 5. The Base model generates two words, 'sunset' and 'moon' in poem (1), which appear together and thus cause the conflict of time. The word 'fishing jetty' is confusing without any necessary explanation in the context. In contrast, poem (2) describes a clearer scene and expresses some emotion: a lonely man takes a boat from morning till night and then falls asleep solitarily.
In poem (3), Mem generates some meaningful words, such as 'phoenix tree', 'wild goose' and 'friend'. However, there isn't any clue to link them together, resulting in poor coherence. On the contrary, things in poem (4) are tightly connected. For example, 'moonlight' is related to 'night'; 'rain', 'frost' and 'dew' are connected with 'cold'.
Poem (5) expresses almost nothing. The first two lines seem to talk about the change of time. But the last two lines are almost unrelated to 'time change'. Poem (6) talks about an old poet, with the description of cheap wine, poem and dream, expressing something about life and time. However, the human-authored poem (7) does much better. It seems to describe a mosquito, but in fact, it's a metaphor of the author himself.

Conclusion and Future Work
In this work, we address two substantial problems in automatic poetry generation: lack of diversity, and loss-evaluation mismatch, which are caused by MLE-based neural models. To this end, we directly model the four widely used human evaluation criteria and design corresponding automatic rewarders. We use these explicit rewards to guide gradient update by reinforcement learning. Furthermore, inspired by writing theories, we propose a novel mutual learning schema to further improve the performance. Mimicking the poetry learning activity, we simultaneously train two generators, which will not only be taught by the rewarders but also learn from each other. Experi- mental results show our method achieves significant improvement both on automatic rewards and human evaluation scores, outperforming the current state-of-the-art model 3 .
There are still lots to do. Can we better model the meaningfulness of a whole poem? Can we quantify some other intractable criteria, e.g, poeticness? Besides, we only tried two learners in this work. Would the collaboration of more learners lead to better results? How to design the methods of communication among many generators? We will explore these questions in the future.