Rhetorically Controlled Encoder-Decoder for Modern Chinese Poetry Generation

Rhetoric is a vital element in modern poetry, and plays an essential role in improving its aesthetics. However, to date, it has not been considered in research on automatic poetry generation. In this paper, we propose a rhetorically controlled encoder-decoder for modern Chinese poetry generation. Our model relies on a continuous latent variable as a rhetoric controller to capture various rhetorical patterns in an encoder, and then incorporates rhetoric-based mixtures while generating modern Chinese poetry. For metaphor and personification, an automated evaluation shows that our model outperforms state-of-the-art baselines by a substantial margin, while human evaluation shows that our model generates better poems than baseline methods in terms of fluency, coherence, meaningfulness, and rhetorical aesthetics.


Introduction
Modern Chinese poetry, originating from 1900 CE, is one of the most important literary formats in Chinese culture and indeed has had a profound influence on the development of modern Chinese culture. Rhetoric is a vital element in modern poetry, and plays an important role in enhancing its aesthetics. Incorporating intentional rhetorical embellishments is essential to achieving the desired stylistic aspects of impassioned modern Chinese poetry. In particular, the use of metaphor and personification, both frequently used forms of rhetoric, are able to enrich the emotional impact of a poem. Specifically, a metaphor is a figure of speech that describes one concept in terms of another one. Within this paper, the term "metaphor" is considered in the sense of a general figure of * * The work was done when Zuohui Fu and Jie Cao were interns at Pattern Recognition Center, WeChat AI, Tencent Inc. speech 比喻 (bi yu), encompassing both metaphor in its narrower sense and similes. Personification is a figure of speech in which a thing, an idea or an animal is given human attributes, i.e., nonhuman objects are portrayed in such a way that we feel they have the ability to act like human beings. For example, 她笑起来像花儿一样 ('She smiles like lovely flowers' ) with its connection between smiling and flowers highlights extraordinary beauty and pureness in describing the verb 'smile'. 夜空中的星星眨着眼睛 ('Stars in the night sky squinting' ) serves as an example of personification, as stars are personified and described as squinting, which is normally considered an act of humans, but here is invoked to more vividly describe twinkling stars. As is well known, rhetoric encompasses a variety of forms, including metaphor, personification, exaggeration, and parallelism. For our work, we collected more than 8,000 Chinese poems and over 50,000 Chinese song lyrics. Based on the statistics given in Table 1, we observe that metaphor and personification are the most frequently used rhetorical styles in modern Chinese poetry and lyrics (see Section 4.1 for details about this data).  Hence, we will mainly focus on the generation of metaphor and personification in this work. As an example, an excerpt from the modern Chinese poem 独自 (Alone) is given in Figure 1, where the fourth sentence (highlighted in blue) invokes a metaphorical simile, while the second one (highlighted in red) contains a personification. In recent years, neural generation models have become widespread in natural language processing (NLP), e.g., for response generation in dialogue (Le et al., 2018), answer or question generation in question answering, and headline generation in news systems. At the same time, poetry generation is of growing interest and has attained high levels of quality for classical Chinese poetry. Previously, Chinese poem composing research mainly focused on traditional Chinese poems. In light of the mostly short sentences and the metrical constraints of traditional Chinese poems, the majority of research attention focused on term selection to improve the thematic consistency (Wang et al., 2016).
In contrast, modern Chinese poetry is more flexible and rich in rhetoric. Unlike sentimentcontrolled or topic-based text generation methods (Ghazvininejad et al., 2016), which have been widely used in poetry generation, existing research has largely disregarded the importance of rhetoric in poetry generation. Yet, to emulate humanwritten modern Chinese poems, it appears necessary to consider not only the topics but also the form of expression, especially with regard to rhetoric. In this paper, we propose a novel rhetorically controlled encoder-decoder framework inspired by the above sentiment-controlled and topic-based text generation methods, which can effectively generate poetry with metaphor and personification.
Overall, the contributions of the paper are as follows: • We present the first work to generate modern Chinese poetry while controlling for the use of metaphor and personification, which play an essential role in enhancing the aesthetics of poetry.
• We propose a novel metaphor and personification generation model with a rhetorically controlled encoder-decoder.
• We conduct extensive experiments showing that our model outperforms the state-of-theart both in automated and human evaluations.
2 Related Work

Poetry Generation
Poetry generation is a challenging task in NLP. Traditional methods (Gervás, 2001;Manurung, 2004;Greene et al., 2010;He et al., 2012) relied on grammar templates and custom semantic diagrams. In recent years, deep learning-driven methods have shown significant success in poetry generation, and topic-based poetry generation systems have been introduced (Ghazvininejad et al., 2017(Ghazvininejad et al., , 2018Yi et al., 2018b). In particular, Zhang and Lapata (2014) propose to generate Chinese quatrains with Recurrent Neural Networks (RNNs), while Wang et al. (2016) obtain improved results by relying on a planning model for Chinese poetry generation.
Recently, Memory Networks (Sukhbaatar et al., 2015) and Neural Turing Machines (Graves et al., 2014) have proven successful at certain tasks. The most relevant work for poetry generation is that of Zhang et al. (2017), which stores hundreds of human-authored poems in a static external memory to improve the generated quatrains and achieve a style transfer. The above models rely on an external memory to hold training data (i.e., external poems and articles). In contrast, Yi et al. (2018a) dynamically invoke a memory component by saving the writing history into memory.

Stylistic Language Generation
The ability to produce diverse sentences in different styles under the same topics is an important characteristic of human writing. Some works have explored style control mechanisms for text generation tasks. For example, Zhou and Wang (2018) use naturally labeled emojis for large-scale emotional response generation in dialogue. Ke et al. (2018) and  propose a sentence controlling function to generate interrogative, imperative, or declarative responses in dialogue. For the task of poetry generation,  introduce an unsupervised style labeling to generate stylistic poetry, based on mutual information. Inspired by the above works, we regard rhetoric in poetry as a specific style and adopt a Conditional Variational Autoencoder (CVAE) model to generate rhetoric-aware poems.
CVAEs (Sohn et al., 2015;Larsen et al., 2016) extend the traditional VAE model (Kingma and Welling, 2014) with an additional conditioned label to guide the generation process. Whereas VAEs essentially directly store latent attributes as probability distributions, CVAEs model latent variables conditioned on random variables. Recent research in dialogue generation shows that language generated by VAE models benefit from a significantly greater diversity in comparison with traditional Seq2Seq models. Recently, CVAEs and adversarial training have been explored for the task of generating classical Chinese poems .

Methodology
In this paper, our goal is to leverage metaphor and personification (known as rhetoric modes) in modern Chinese poetry generation using a dedicated rhetoric control mechanism.

Overview
Before presenting our model, we first formalize our generation task. The inputs are poetry topics specified by K user-provided keywords {w k } K k=1 . The desired output is a poem consisting of n lines {L i } n i=1 . Since we adopt a sequence-to-sequence framework and generate a poem line by line, the task can be cast as a text generation one, requiring the repeated generation of an i-th line that is coherent in meaning and related to the topics, given the previous i − 1 lines L 1:i−1 and the topic keywords w 1:K . In order to control the rhetoric modes, the rhetoric label r may be provided either as an input from the user, or from an automatic prediction based on the context. Hence, the task of poetry line generation can be formalized as follows: As mentioned above, incorporating rhetoric into poetic sentences requires controlling for the rhetoric mode and memorizing contextual topic information. To this end, we first propose two conditional variational autoencoder models to effectively control when to generate rhetoric sentences, and which rhetoric mode to use. The first model is a Manual Control CVAE model (MCCVAE). It receives the user's input signal as a rhetoric label r to generate the current sentence in the poem, and is designed for user-controllable poetry generation tasks. The second model is the Automatic Control CVAE (ACCVAE), which automatically predicts when to apply appropriate forms of rhetoric and generates the current sentence based on contextual information.
Subsequently, to memorize pertinent topic information and generate more coherent rhetorical sentences, we propose a topic memory component to store contextual topic information. At the same time, we propose a rhetorically controlled decoder to generate appropriate rhetorical sentences. This is a mechanism to learn the latent rhetorical distribution given a context and a word, and then perform a rhetorically controlled term selection during the decoding stage. Our proposed framework will later be presented in more detail in Figure 2.

Seq2seq Baseline
Our model is based on the sequence-to-sequence (Seq2Seq) framework, which has been widely used in text generation. The encoder transforms the current input text X = {x 1 , x 2 , ..., x J } into a hidden representation H = {h 1 , h 2 , ..., h J }, as follows: where LSTM is a Long Short-Term Memory Network, and e(x j ) denotes the embedding of the word x j . The decoder first updates the hidden state S = {s 1 , s 2 , .., s T }, and then generates the next sequence Y = {y 1 , y 2 , ..., y T } as follows: where this second LSTM does not share parameters with the encoder's network.

Proposed Models
In the following, we will describe our models for rhetorically controlled generation.

Manual Control (MC) CVAE
We introduce a Conditional Variational Autoencoder (CVAE) for the task of poetry generation. Mathematically, the CVAE is trained by maximizing a variational lower bound on the conditional likelihood of Y given c, in accordance with  where z, c, and Y are random variables, and the latent variable z is used to encode the semantics and rhetoric of the generated sentence. In our manual control model, the conditional variables that capture the input information are c = [h X ; e(r)], where e(r) is the embedding of the rhetorical variable r. h X is the encoding of current poem sentences X, and the target Y represents the next sentence to be generated.
Then on top of the traditional Seq2seq model, we introduce a prior network, a recognition network, and the decoder: (i) The prior network p P (z|c) is an approximation of p(z|c).
The recognition network q R (z|Y, c) serves to approximate the true posterior p(z|Y, c). Then the variational lower bound to the loss − log p(Y |c) can be expressed as: Here, θ D , θ P , θ R are the parameters of the decoder, prior network, and recognition network, respectively. Intuitively, the second term maximizes the sentence generation probability after sampling from the recognition network, while the first term minimizes the distance between prior and recognition network. Usually, we assume that both the prior and the recognition networks are multivariate Gaussian distributions, and their mean and log variance are estimated through multilayer perceptrons (MLP) as follows: A single layer of the LSTM is used to encode the current lines, and obtain the h X component of c.
The same LSTM structure is also used to encode the next line Y in the training stage. By using Eq. (6), we calculate the KL divergence between these distributions to optimize Eq. (5). Following the practice in Zhao et al. (2017), a reparameterization technique is used when sampling from the recognition and the prior network during training and testing.

Automatic Control(AC) CVAE
In the ACCVAE model, we first predict the rhetorical mode of the next sentence using an MLP that is designed as follows: In this case, the conditional variable c is also [h X ; e(r)], where h X is taken as the last hidden state of the encoder LSTM. The loss function is then defined as: In this paper, a two-layer MLP is used for Eq. (7).

Topic Memory Component
As shown above, LSTMs are used to encode the lines of the poem. Considering the fact that Memory Networks (Sukhbaatar et al., 2015) have demonstrated great power in capturing long temporal dependencies, we incorporate a memory component for the decoding stage. By equipping it with a larger memory capacity, the memory is able to retain temporally distant information in the writing history, and provide a RAM-like mechanism to support model execution. In our poetry generation model, we rely on a special topic memory component to memorize both the topic and the generation history, which are of great help in generating appropriate rhetorical and semantically consistent sentences. As illustrated in Figure 2, our topic memory is M ∈ R K ×d h , where each row of the matrices is a memory slot with slot size d h and the number of slots is K . Before generating the i-th line L i , topic words w k from the user and the input text are written into the topic memory in advance, which remains unchanged during the generation of a sentence.
Memory Reading. We introduce an Addressing Function as α = A(M, q), which calculates the probabilities of each slot of the memory being selected and invoked. Specifically, we define: where σ defines a non-linear layer, q is the query vector, b is the parameter, M is the memory to be addressed, M k is the k-th slot of M , and α k is the k-th element in vector α. For the topic memory component, the input q should be [s t−1 ; c; z], so the topic memory is read as follow: where α is the reading probability vector, s t−1 represents the decoder hidden state, and o t is the memory output at the t-th step.

Rhetorically Controlled Decoder
A general Seq2seq model may tend to emit generic and meaningless sentences. In order to create poems with more meaningful and diverse rhetoric, we propose a rhetorically controlled decoder. It assumes that each word in a poem sentence has a latent type designating it as a content word or as a rhetorical word. The decoder then calculates a word type distribution over the latent types given the context, and computes type-specific generation distributions over the entire vocabulary. The final probability of generating a word is a mixture of type-specific generation distributions, where the coefficients are type probabilities. The final generation distribution P(y t | s t , o t , z, c) from the sampled word is defined as where τ t denotes the word type at time step t. This specifies that the final generation probability is a mixture of the type-specific generation probability P(y t | τ t , s t , z, c), weighted by the probability of the type distribution P(τ t | s t , z, c). We refer to this decoder as a rhetorically controlled decoder. The probability distribution over word types is given by where s t is the hidden state of the decoder at time step t, W ∈ R k×d with the dimension d. The word type distribution predictor can be trained in decoder training stage together. The type-specific generation distribution is given by where W content , W rhetoric ∈ R |V |×d , and |V | is the size of the entire vocabulary. Note that the type-specific generation distribution is parameterized by these matrices, indicating that the distribution for each word type has its own parameters. Instead of using a single distribution, our rhetorically controlled decoder enriches the model by applying multiple type-specific generation distributions, which enables the model to convey more information about the potential word to be generated. Also note that the generation distribution is over the same vocabulary.

Overall Loss Function
The CVAE and Seq2seq model with the rhetorically controlled decoder should be trained jointly. Therefore, the overall loss L is a linear combination of the KL term L KL , the classification loss of the rhetoric predictor cross entropy (CE) L predictorCE , the generation loss of the rhetorical controlled decoder cross entropy L decoderCE , and the word type classifier (word type distribution predictor) cross entropy L word classifier : The technique of KL cost annealing can address the optimization challenges of vanishing latent variables in this encoder-decoder architecture. γ is set to 0 if the Manual Control CVAE is used, and 1 otherwise.

Datasets and Setups
We conduct all experiments on two datasets 1 . One is a modern Chinese poetry dataset, while the other is a modern Chinese lyrics dataset. We collected the modern Chinese poetry dataset from an online poetry website 2 and crawled about 100,000 Chinese song lyrics from a small set of online music websites. The sentence rhetoric label is required for our model training. To this end, we built a classifier to predict the rhetoric label automatically. We sampled about 15,000 sentences from the original poetry dataset and annotated the data manually with three categories, i.e., metaphor, personification, and other. This dataset was divided into a training set, validation set, and test set. Three classifiers, including LSTM, Bi-LSTM, and Bi-LSTM with a self-attention model, were trained on this dataset. The Bi-LSTM with self-attention classifier (Yang et al., 2016) outperforms the other models and achieves the best accuracy of 0.83 on the test set. In this classifier, the sizes of word embedding, hidden state and the attention size are set to 128, 256, 30 respectively, and a two-layer LSTM is used. The results for different classes are given in Table 2. Additionally, we select a large number of poem sentences with metaphor and personification to collect the corresponding rhetorical words. Based on statistics of word counts and part of speech, we obtained over 500 popular words associated with metaphor and personification as rhetorical words. Our statistical results show that these words cover a wide range of metaphorical and anthropomorphic features.
Meanwhile, in our entire model, the sizes of word embedding, rhetoric label embedding, hidden state are set to 128, 128, 128 respectively. The dimensionality of the latent variable is 256 and a single-layer decoder is used. The word embedding is initialized with word2vec vectors pre-trained on the whole corpus.

Models for Comparisons
We also compare our model against previous stateof-the-art poetry generation models: • Seq2Seq: A sequence-to-sequence generation model, as has been successfully applied to text generation and neural machine translation (Vinyals and Le, 2015).
• HRED: A hierarchical encoder-decoder model for text generation (Serban et al., 2016), which employs a hierarchical RNN to model the sentences at both the sentence level and the context level.
• WM: A recent Working Memory model for poetry generation (Yi et al., 2018b).
• CVAE: A standard CVAE model without the specific decoder. We adopt the same architecture as that introduced in Zhao et al. (2017).

Evaluation Design
In order to obtain objective and realistic evaluation results, we rely on a combination of both machine evaluation and human evaluation. Automated Evaluation. To measure the effectiveness of the models automatically, we adopt several metrics widely used in existing studies. BLEU scores 3 and Perplexity are used to quantify   how well the models fit the data. The Rhetoric-F1 score is used to measure the rhetorically controlled accuracy of the generated poem sentences. Specifically, if the rhetoric label of the generated sentence is consistent with the ground truth, the generated result is right, and wrong otherwise. The rhetoric label of each poem sentence is predicted by our rhetoric classifier mentioned above (see 4.1 for details about this classifier). Distinct-1/Distinct-2  is used to evaluate the diversity of the generated poems.
Human Evaluation. Following previous work (Yi et al., 2018b), we consider four criteria for human evaluation: • Fluency: Whether the generated poem is grammatically correct and fluent.
• Coherence: Whether the generated poem is coherent with the topics and contexts.
• Meaningfulness: Whether the generated poem contains meaningful information.
• Rhetorical Aesthetics: Whether the generated rhetorical poem has some poetic and artistic beauty.
Each criterion is scored on a 5-point scale ranging from 1 to 5. To build a test set for human evaluation, we randomly select 200 sets of topic words to generate poems with the models. We invite 10  experts 4 to provide scores according to the above criteria and the average score for each criterion is computed.

Evaluation Results
The results of the automated evaluation are given in Table 3. Our MC model obtains a higher BLEU score and lower perplexity than other baselines on the poetry dataset, which suggests that the model is on a par with other models in generating grammatical sentences. Note that our AC model obtains higher Distinct-1 and Distinct-2 scores because it tends to generate more diverse and informative results.
In terms of the rhetoric generation accuracy, our model outperforms all the baselines and achieves  the best Rhetoric-F1 score of 0.67 on the poetry dataset, which suggests that our model can control the rhetoric generation substantially more effectively. The other baselines have low scores because they do not possess any direct way to control for rhetoric. Instead, they attempt to learn it automatically from the data, but do not succeed at this particularly well. Table 4 provides the results of the human evaluation. We observe that on both datasets, our method achieves the best results in terms of the Meaningfulness and Rhetorical Aesthetics metrics. Additionally, we find that the WM model has higher scores in the Coherence metric over the two datasets, indicating that the memory component has an important effect on the coherence and relevance of the topics. The CVAE model obtains the best results in terms of the Fluency metric, which shows that this model can generate more fluent sentences, but it lacks coherence and meaningfulness. Overall, our model generates poems better than other baselines in terms of fluency, coherence, meaningfulness, and rhetorical aesthetics. In particular, these results show that a rhetorically controlled encoder-decoder can generate reasonable metaphor and personification in poems. Table 5 presents example poems generated by our model. These also clearly show that our model can control the rhetoric-specific generation. In Case 1, our model is able to follow the topics 恋爱;脸面 (love, face) and the metaphor label when generating the sentence, e.g., 你的眼神像心灵的花朵 一样绽放 (Your eyes blossom like flowers in my heart). In Case 2, our model obtaining the personification signal is able to generate a personification word 走来 (walk ).

Case Study
As an additional case study, we also randomly select a set of topic words {青 春 Youth, 爱 情 Love, 岁月 Years} and present three five-line poems generated by Seq2Seq, WM, and our model, respectively, with the same topics and automatically controlled rhetoric. All the poems generated by the different models according to the same topic words are presented in Figures 3 and 4. The poem generated by our model is more diverse and aesthetically pleasing with its use of metaphor and personification, while the two other poems focus more on the topical relevance.

Conclusion and Future work
In this paper, we propose a rhetorically controlled encoder-decoder for modern Chinese poetry generation. Our model utilizes a continuous latent variable to capture various rhetorical patterns that govern the expected rhetorical modes and introduces rhetoric-based mixtures for generation. Experiments show that our model outperforms state-of-the-art approaches and that our model can effectively generate poetry with convincing metaphor and personification.
In the future, we will investigate the possibility of incorporating additional forms of rhetoric, such as parallelism and exaggeration, to further enhance the model and generate more diverse poems.