Generating Classical Chinese Poems via Conditional Variational Autoencoder and Adversarial Training

It is a challenging task to automatically compose poems with not only fluent expressions but also aesthetic wording. Although much attention has been paid to this task and promising progress is made, there exist notable gaps between automatically generated ones with those created by humans, especially on the aspects of term novelty and thematic consistency. Towards filling the gap, in this paper, we propose a conditional variational autoencoder with adversarial training for classical Chinese poem generation, where the autoencoder part generates poems with novel terms and a discriminator is applied to adversarially learn their thematic consistency with their titles. Experimental results on a large poetry corpus confirm the validity and effectiveness of our model, where its automatic and human evaluation scores outperform existing models.


Introduction
In mastering concise, elegant wordings with aesthetic rhythms in fixed patterns, classical Chinese poem is a special cultural heritage to record personal emotions and political views, as well as document daily or historical events. Being a fascinating art, writing poems is an attractive task that researchers of artificial intelligence are interested in (Tosa et al., 2008;Wu et al., 2009;Netzer et al., 2009;Oliveira, 2012;Yan et al., 2013Yan et al., , 2016aGhazvininejad et al., 2016Ghazvininejad et al., , 2017Singh et al., 2017;Xu et al., 2018), partially for the reason that poem generation and its related research could benefit other constrained natural language generation tasks. Conventionally, rule-based models (Zhou et al., 2010) and statistical machine translation (SMT) models (He et al., 2012) are *Corresponding author: Rui Yan (ruiyan@pku.edu.cn) †Work was partially done at Tencent AI Lab.
proposed for this task. Recently, deep neural models are employed to generate fluent and natural poems (Wang et al., 2016a;Yan, 2016;Zhang et al., 2017a). Although these models look promising, they are limited in many aspects, e.g., previous studies generally fail to keep thematic consistency (Wang et al., 2016c; and improve term 1 novelty (Zhang et al., 2017a), which are important characteristics of poems.
In classical Chinese poem composing, thematic consistency and term novelty are usually mutually exclusive conditions to each other, i.e., consistent lines may bring duplicated terms while intriguing choices of characters could result in thematic diversities. On one hand, thematic consistency is essential for poems; it is preferred that all lines concentrate on the same theme throughout a poem. Previous work mainly focused on using keywords (Wang et al., 2016c;Hopkins and Kiela, 2017) to plan a poem so as to generate each line with a specific keyword. Such strategy is risky for the reason that the keywords are not guaranteed consistent in a topic, especially when they are generated or extracted from an inventory (Wang et al., 2016c). On the other hand, Chinese poems are generally short in length, with every character carefully chosen to be concise and elegant. Yet, prior poem generation models with recurrent neural networks (RNN) are likely to generate highfrequency characters (Zhang et al., 2017a), and the resulted poems are trivial and boring. The reason is that RNN tends to be entrapped within local word co-occurrences, they normally fail to capture global characteristic such as topic or hierarchical semantic properties (Bowman et al., 2016).
To address the aforementioned shortcomings, RNN is extended to autoencoder (Dai and Le, 2015) for improving sequence learning, which has been proven to be appealing in explicitly modeling global properties such as syntactic, semantic, and discourse coherence (Li et al., 2015). Moreover, boosting autoencoder with variational inference (Kingma and Welling, 2014), known as variational autoencoder (VAE), can generate not only consistent but also novel and fluent term sequences (Bowman et al., 2016). To generalize VAE for versatile scenarios, conditional variational autoencoders (CVAE) are proposed to supervise a generation process with certain attributes while maintaining the advantages of VAE. It is verified in supervised dialogue generation (Serban et al., 2017;Shen et al., 2017; that CVAE can generate better responses with given dialogue contexts. Given the above background and to align it with our expectations for poem generation, it is worth trying to apply CVAE to create poems. In the meantime, consider that modeling thematic consistency with adversarial training is proven to be promising in controlled text generation (Hu et al., 2017), models for semantic matching can be potentially improved with an explicit discriminator (Wu et al., 2017), so does poem generation.
In this paper, we propose a novel poem generation model (CVAE-D) using CVAE to generate novel terms and a discriminator (D) to explicitly control thematic consistency with adversarial training. To the best of our knowledge, this is the first work of generating poems with the combination of CVAE and adversarial training. Experiments on a large classical Chinese poetry corpus confirm that, through encoding inputs with latent variables and explicit measurement of thematic information, the proposed model outperforms existing ones in various evaluations. Quantitative and qualitative analysis indicate that our model can generate poems with not only distinctive terms, but also consistent themes to their titles.

VAE and CVAE
In general, VAE consists of an encoder and a decoder, which correspond to the encoding process where input x is mapped to a latent variable z, i.e., x → z, and the decoding process where the latent variable z is reconstructed to the input x, i.e., z → x. In detail, the encoding process computes a posterior distribution q θ (z | x) given the input x. Similarly, the decoding process can be formulated as p θ (x | z), representing the probability distribu- Figure 1: The overall framework of our poem generation model. Solid arrows present the generation process of each line L i on the condition of the previous line L i−1 and title T . Black dotted arrows represent the adversarial learning for thematic consistency. The red dashed arrow refers to the back-propagation of the discriminator to the CVAE. tion of generating input x conditioned on z, where z has a regularized prior distribution p θ (z), i.e. a standard Gaussian distribution. Herein θ represents the parameters of both encoder and decoder. Importantly, presented by Kingma and Welling (2014), on the condition of large datasets and intractable integral of the marginal likelihood p θ (x), the true posterior q θ (z | x) is simulated by a variational approximation q φ (z | x) in modeling the encoding process, where φ is the parameters for q.
In learning a VAE, its objective is to maximize the log-likelihood log p θ (x) over input x. To facilitate learning, one can target on pushing up the variational lower bound of log p θ (x): such that the original log p θ (x) is also optimized.
Herein the KL-divergence term KL(·) can be viewed as the regularization for encouraging the approximated posterior q φ (z|x) to be close to the prior p θ (z), e.g. standard Gaussian distribution. E[·] is the reconstruction loss conditioned on the approximation posterior q φ (z|x), which reflects how well the decoding process goes. CVAE extends VAE with an extra condition c to supervise the generation process by modifying the. The objective of CVAE is thus to maximize the reconstruction log-likelihood of the input x under the condition of c. Following the operation for VAE, we have the corresponding variational lower bound of p θ (x|c) formulated as which is similar to Eq.1 except that all items are introduced with c, such as q φ (z|x, c) and p θ (z|c), referring to the conditioned approximate posterior and the conditioned prior, respectively.

Problem Formulation
Following the text-to-text generation paradigm (Ranzato et al., 2015;Kiddon et al., 2016;Hu et al., 2017;Ghosh et al., 2017), our task has a similar problem setting with conventional studies (Zhang and Lapata, 2014;Wang et al., 2016c), where a poem is generated in a line-by-line manner that each line serves as the input for the next one, as illustrated in Figure 1. To formulate this task, we separate its input and output with necessary notations as follows.
The INPUT of the entire model is a title, T =(e 1 ,e 2 ,. . . ,e N ), functionalized as the theme of the target poem 2 , where e i refers to i-the character's embedding and N is the length of the title. The first line L 1 is generated only conditioned on the title T , once this step is done, the model takes the input of the previous generated line as well as the title at each subsequent step, until the entire poem is completed.
The overall OUTPUT is an n-line poem, formulated as (L 1 , L 2 , . . . , L n ), where L i = (e i,1 , e i,2 , . . . , e i,m ) denotes each line in the poem, with e i,j referring to the embedding of a character at i-th line on j-th position, ∀ 1 ≤ i ≤ n, 1 ≤ j ≤ m. Particularly for classic Chinese poems, there are strict patterns, which require m = 5 or m = 7, and n = 4 3 or n = 8 4 . Once a template is chosen, m and n are fixed. In this paper, we mainly focus on n = 4.

The Model
As illustrated in Figure 1, our CVAE-D consists of two parts, CVAE and a discriminator, where their details are elaborated in the following subsections.

The CVAE
The CVAE includes an encoder and a decoder, plays as the core part in our model that generates classic Chinese poems. The encoder encodes both the title and lines with shared parameters by a bidirectional RNN (Schuster and Paliwal, 1997) with gated recurrent units (GRU) (Chung et al.,2 We directly treat the title as the theme for each poem in this paper instead of transferring it to a few keywords as that was done in . 3 The quatrain. 4 The eight-line regulated verse. Following previous work (Kingma and Welling, 2014; , we assume that the variational approximate posterior is a multivariate Gaussian N with a diagonal covariance structure q φ (z|x, c) = N (µ, σ 2 I). Thus µ and σ are the key parameters 6 to be learned, and they are computed by where W q and b q are trainable parameters. Similarly, the prior p θ (z|c) can be formulated as another multivariate Gaussian N (µ , σ 2 I); its parameters are then calculated by a single-layer fully-connected neural network (denoted as MLP) with the tanh(·) activation function, The decoder uses a one-layer RNN with GRU that takes [z, c] as the input to predict each line L i . The hidden states of the GRU, (s 1 , s 2 , . . . , s m ) 7 , are not only used to generate the reconstructed lines, but also passed to the discriminator for learning thematic consistency.
The entire encoder and the decoder are used throughout the training process, with only part of the encoder (objects with solid lines in Figure 2) and the decoder applied in prediction. It is worth noting that, θ and φ mentioned in §2.1 are not explicitly corresponded to any particular neural networks described in this section. Instead, the probability process denoted by θ corresponds to the decoding and part of the encoding process, so does φ, i.e., φ = {W q , b q }.

The Discriminator
The discriminator is introduced in our model to evaluate thematic consistency between the input title and the generated poem lines. The loss from this discriminator is then back-propagated to the decoder of the CVAE to enhance its training. In this paper, we employ a procedure that consists of two steps. First, we compute an interaction (or matching) matrix according to a generated line L i g and the title T , where L i g is the reconstructed result of L i . Then, we utilize a convolutional neural network (CNN) to learn the matching score between L i g and T , where the score is interpreted as the degree of thematic consistency. Specifically, in the discriminator, we treat L i g and L i as the negative and positive instance, referring to thematically inconsistent and consistent case, respectively.
In detail, for the first step, we use the state sequence 8 of the decoder to represent L i g , i.e., 8 Following previous work (Goyal et al., 2016;Hu et al., 2017) using adversarial training, using state sequence instead of the outputs is because the discrete nature of the outputs L i g = (s 1 , s 2 , . . . , s m ). A dimension transformation is then conducted on L i g , to align L i g and T : where ReLU is the rectified linear units activation function (Nair and Hinton, 2010), with trainable parameters W d and b d . In doing so, the dimension of s i is identical to character embeddings. The transformed line is then denoted as L i g = (s 1 , s 2 , . . . , s m ). Thus the interaction matrix between L i g and T is then formulated as where M g ∈ R N ×m ; " · " denotes the matrix multiplication.
In the second step, a CNN is used to extract features from the interaction matrix. The resulted feature matrix is calculated by F = CN N (M g ). Then, we apply a max-overtime pooling (Collobert et al., 2011) over F to capture the most salient information. After this operation, an MLP with one hidden layer is used to flatten the feature matrix and generate the final matching score m g ∈ (0, 1) via a sigmoid activation function.
In addition to m g , the matching score m t between the positive sample L i and T is computed in a process similar to the above procedure, except the dimension transformation because character embeddings in both title T and L i share the same dimension. 9 Finally, following the routine of generative adversarial networks (GAN) (Goodfellow et al., 2014), the discriminator is trained to measure the thematic consistency of generated lines and the ground truth lines according to the matching scores m g and m t , with the objective function minimized. Note that the discriminator is only applied during the training process, where the parameters of the encoder and decoder are enhanced by the feedback of the discriminator.
hinders gradient calculation. 9 Different from Li g , Li is represented directly by its sequence of character embeddings, for the reason that the discriminator is only connected with the decoder while Li does not go through it. Otherwise, if the encoder states are passed to the discriminator, the loss would be back-propagated to the encoder and disturb CVAE training accordingly.

Training the Model
The overall objective of CVAE-D is to minimize with respect to parameters of the CVAE, where L CV AE is the loss of CVAE, corresponding to −L(θ, φ; x, c). In doing so, L D is maximized with regard to parameters of the discriminator, referring to that the generated poems are thematic consistent and able to confuse the discriminator. Herein λ is a balancing parameter. We train the CVAE and the discriminator alternatively in a two-step adversarial fashion similar to that was done in Zhang et al. (2017c). This training strategy is repeated until the L CVAE-D is converged.

Datasets
To learn our poem generation model, we collect two corpora for experiments: a collection of classic Chinese poems from Tang dynasty (PTD), and the other from Song dynasty (PSD). Statistics of the two corpora are reported in Table 1. Note that for classical Chinese poem, the dominant genres are quatrain and eight-line regulated verse with either 5 or 7 characters in each line. As a result, our model is targeted to generate poems within these two genres, especially the quatrain. All titles of poems are treated as their themes. We randomly choose 1,000 and 2,000 poems for validation and test, respectively, with the rest poems for training.

Baselines
In addition to our CVAE-D, several highly related and strong methods are conducted as baselines in our experiments, including: S2S, the conventional sequence-to-sequence model (Sutskever et al., 2014), which has proven to be successful in neural machine translation (NMT) and other text generation tasks.
AS2S and its extension Key-AS2S and Mem-AS2S, where AS2S is the S2S model integrated

Criterion Description
Consistency Whether a poem displays a consistent theme.

Fluency
Whether a poem is grammatically satisfied.

Meaning
How meaningful the content of a poem is.

Poeticness
Whether a poem has the attributes of poetry.
Overall Average scores of the above four criteria. with attention mechanism (Bahdanau et al., 2014). Key-AS2S and Mem-AS2S are AS2S with keywords planning (Wang et al., 2016c) and a memory module (Zhang et al., 2017a), respectively. Particularly, they are dedicated models designed for Chinese poem generation. GAN, a basic implementation of generative adversarial networks (Goodfellow et al., 2014) for this task on top of S2S. This baseline is added to investigate the performance of introducing a discriminator to simple structures other than CVAE.
CVAE 10 and its extension CVAE-Key, where the former is the conventional CVAE model and the latter refers to the combination of CVAE and keywords planning . The CVAE baseline is used for investigating how poem generation can be done with only CVAE, while CVAE-Key aims to provide a comparison to our model with a different technique for thematic control.

Model Settings
All baselines and the CVAE-D are trained with the following hyper-parameters. The dimension of character embedding is set to 300 for the most frequent 10,000 characters in our vocabulary. The hidden state sizes of the GRU encoder and decoder are set to 500. All trainable parameters, e.g., W q and W d , are initialized from a uniform distribution [−0.08, 0.08]. We set the mini-batch size to 80 and employ the Adam (Kingma and Ba, 2014) for optimization. We utilize the gradient clipping strategy (Pascanu et al., 2013) to avoid gradient explosion, with the gradient clipping value set to 5.
In addition to the shared hyper-parameters, we have particular settings for CVAE-D. The layer size of M LP p is set to 400. The dimension of latent variable z is set to 300. For the CNN used in the discriminator, its kernel size is set to (5, 5), with the stride size k to 2. We follow the conventional setting (Hu et al., 2017;Creswell et al., Table 3: Results of automatic and human evaluations. BLEU-1 and BLEU-2 are BLEU scores on unigrams and bigrams (p < 0.01); Sim refer to the similarity score; Dist-n corresponds to the distinctness of n-gram, with n = 1 to 4; Con., Flu., Mea., Poe., Ovr. represent consistency, fluency, meaning, poeticness, and overall, respectively. 2017) to set the balancing parameter λ to 0.1. 11

Evaluation Metrics
To comprehensively evaluate the generated poems, we employ the following metrics: BLEU: The BLEU score (Papineni et al., 2002) is an effective metric, widely used in machine translation, for measuring word overlapping between ground truth and generated sentences. In poem generation, BLEU is also utilized as a metric in previous studies (Zhang and Lapata, 2014;Wang et al., 2016a;Yan, 2016;Wang et al., 2016b). We follow their settings in this paper.
Similarity: For thematic consistency, it is challenging to automatically evaluate different models. We adopt the embedding average metric to score sentence-level similarity as that was applied in Wieting et al. (2015). In this paper, we accumulate the embeddings of all characters from the generated poems and that from the given title, and use cosine to compute the similarity between the two accumulated embeddings.
Distinctness: As an important characteristic, poems use novel and unique characters to maintain their elegance and delicacy. Similar to that proposed for dialogue systems (Li et al., 2016), this evaluation is employed to measure character diversity by calculating the proportion of distinctive [1,4]-grams 12 in the generated poems, where final distinctness values are normalized to [0,100].
Human Evaluation: Since writing poems is a complicated task, there always exist incoordinations between automatic metrics and human experiences. Hence, we conduct human evaluation to assess the performance of different models. In doing so, each poem is assessed by five annotators who are well educated and have expertise in Chinese poetry. The evaluation is conducted in a blind review manner, where each annotator has no information about the generation method that each poem belongs to. Following previous work (He et al., 2012;Zhang and Lapata, 2014;Wang et al., 2016c;Zhang et al., 2017a), we evaluate generated poems by four criteria, namely, consistency, fluency, meaning, and poeticness. Each criterion is rated from 1 to 3, representing bad, normal, good, respectively. The details are illustrated in Table 2. Table 3 reports the results of both automatic and human evaluations. We analyze the results from the following aspects.

The effect of CVAE
This study is to investigate whether using latent variable and variational inference can improve the diversity and novelty of terms in generated poems. There are two main observations. CVAE significantly improves term novelty. As illustrated in Table 3, CVAE outperforms all baselines significantly in terms of distinctness. With diversified terms, the aesthetics scores also confirm that CVAE can generate poems that correspond to better user experiences. Although Mem-AS2S can generate a rather high distinctness score, it requires a more complicated structure in learning and generating poems. The results confirm the effectiveness of CVAE in addressing the issue of term duplications that occurred in RNN.
CVAE cannot control thematic consistency of generated poems. Recall that thematic consistency and term diversity are usually mutually ex- clusive, CVAE produces the worst result in thematic consistency, which is confirmed in Table 3 by the similarity score in automatic evaluation and the consistency score in human evaluation.

The Influence of the Discriminator
As previously stated, introducing a discriminator with adversarial training is expected to bring positive effect on thematic consistency. We investigate the influence of discriminator with two groups of comparison, i.e., CVAE-D v.s. CVAE, GAN v.s. S2S. Following observations are made in this investigation, which confirm that adversarial learning is an effective add-on to existing models for thematics control, without affecting other aspects.
The discriminator effectively enhances poem generation with thematic information. When the discriminator is introduced, CVAE and S2S model are capable of generating thematically consistent poems, as illustrated by the similarity and meaning scores in Table 3. The BLEU results also confirm that the discriminator can improve the overlapping between generated poems and the ground truth, which serves as thematic consistent cases.
The extra discriminator does not affect base models on irrelevant merits. For any base model, e.g., S2S and CVAE, when adding a discriminator, it is expected that it can bring help on thematic consistency while limiting any inferior effects on other evaluations. This is confirmed in the results, e.g., for distinctness, CVAE-D and GAN are comparable to CVAE and S2S.

The Performance of CVAE-D
Overall, the CVAE-D model substantially outperforms all other models in all metrics. Especially for term novelty and thematic consistency, CVAE-D illustrates an extraordinary balance between them, with observable improvements on both sides. This balance is mainly contributed  from the proposed framework that seamlessly integrates CVAE and the discriminator. Except for the automatic and human evaluation scores, the fact is also supported by the training loss of KL(q φ (z|x, c) p θ (z|c)) and L D as shown in Figure 4, where 1) the KL-divergence of CVAE-D has an analogous trend with CVAE, referring to that the CVAE part in CVAE-D is trained as good as an independent CVAE; 2) the discriminator captures the distinctness of thematic consistency between the generated lines and the ground truth lines at the very early stage of training.

Qualitative Analysis
In addition to evaluating CVAE-D with quantitative results, we also conduct case studies to illustrate its superiority. Figure 5 gives an example of the CVAE-D generated poems, which well demonstrates the capability of our model. The entire poem elegantly expresses a strong theme of "missing my love". 13 It is clearly shown that the choices of the characters, such as 庭 (yard), 枝 (branch), 花 (flower), 红 (red), etc., match with the given title to a certain extent with no one repetitively used. To further investigate how different models perform on thematic consistency, we visualize the correspondence between generated poems (the first two lines) and the given title with heatmaps in Figure 6, where Figure 6(a) and Figure 6(b) illustrate the results yielded by CVAE and CVAE-D, respectively. 14 Obviously, the overall color in Figure 6(a) is lighter than that in Figure 6 may indicate that most of the characters generated by CVAE are not addressed with thematic attentions over the given title. On the opposite, CVAE-D presents darker color in the grids on all related characters, which further reveals the effectiveness of CVAE-D in improving thematic consistency of a poem with respect to its title. It is observed that there are also inferior cases generated by our model. A notable example pattern is that some fine-grained attributes, e.g., sentiment, emotion, are not well aligned across lines, where some lines may deliver different mood from others. Since our model does not explicitly control such attributes, thus one potential solution to address this issue is to introduce other features to model such information, which requires a special design to adjust the current model. We also notice there exists a few extraordinary bad cases where their basic characteristics, such as wording, fluency, etc., are unacceptable. This phenomenon is randomly observed with no patterns, which could be explained by the complexity of the model and the fragile natural of adversarial training (Goodfellow et al., 2014;. Careful parameter setting and considerate module assemble could mitigate this problem, thus lead to potential future work of designing more robust frameworks.

Related Work
Deep Generative Models. This work can be seen as an extension of research on deep generative models (Salakhutdinov and Hinton, 2009;, where most of the previous work, including VAE and CVAE, focused on image generation (Sohn et al., 2015;Yan et al., 2016b). Since GAN (Goodfellow et al., 2014) is also a successful generative model, there are studies tried to integrate VAE and GAN (Larsen et al., 2016). In natural language processing, many recent deep generative models are applied to dialogue systems Serban et al. (2017); Shen et al. (2017);  and text generation with (Hu et al., 2017;Yu et al., 2017;Zhang et al., 2017b;Guo et al., 2018). To the best of our knowledge, this work is the first one integrating CVAE and adversarial training with a discriminator for text generation, especially in a particular text genre, poetry.
Automatic Poem Generation. According to methodology, previous approaches can be roughly classified into three categories: 1) rule and template based methods (Tosa et al., 2008;Wu et al., 2009;Netzer et al., 2009;Zhou et al., 2010;Oliveira, 2012;Yan et al., 2013); 2) SMT approaches (Jiang and Zhou, 2008;Greene et al., 2010;He et al., 2012); 3) deep neural models (Zhang and Lapata, 2014;Wang et al., 2016b;Yan, 2016). Compared to rule-based and SMT models, neural models are able to learn more complicated representations and generate smooth poems. Most recent studies followed this paradigm. For example, Wang et al. (2016c) proposed a modified encoder-decoder model with keyword planning; Zhang et al. (2017a) adopted memory-augmented RNNs to dynamically choose each term from RNN output or a reserved inventory. To improve thematic consistency,  combined CVAE and keywords planning. Compared to them, our approach offers an alternative way for poem generation that can produce novel terms and consistent themes via an integrated framework, without requiring special designed modules or post-processing steps.

Conclusions
In this paper, we proposed an effective approach that integrates CVAE and adversarial training for classical Chinese poem generation. Specifically, we used CVAE to generate each line of a poem with novel and diverse terms. A discriminator was then applied with adversarial training to explicitly control thematic consistency. Experiments conducted on a large Chinese poem corpus illus-trated that through the proposed architecture with CVAE and the discriminator, substantial improvement was observed on the results from our generated poems over those from the existing models. Further qualitative study on given examples and some brief error analyses also confirmed the validity and effectiveness of our proposed approach.