Automatic Poetry Generation from Prosaic Text

In the last few years, a number of successful approaches have emerged that are able to adequately model various aspects of natural language. In particular, language models based on neural networks have improved the state of the art with regard to predictive language modeling, while topic models are successful at capturing clear-cut, semantic dimensions. In this paper, we will explore how these approaches can be adapted and combined to model the linguistic and literary aspects needed for poetry generation. The system is exclusively trained on standard, non-poetic text, and its output is constrained in order to confer a poetic character to the generated verse. The framework is applied to the generation of poems in both English and French, and is equally evaluated for both languages. Even though it only uses standard, non-poetic text as input, the system yields state of the art results for poetry generation.


Introduction
Automatic poetry generation is a challenging task for a computational system.For a poem to be meaningful, both linguistic and literary aspects need to be taken into account.First of all, a poetry generation system needs to properly model language phenomena, such as syntactic well-formedness and topical coherence.Furthermore, the system needs to incorporate various constraints (such as form and rhyme) that are related to a particular poetic genre.And finally, the system needs to exhibit a certain amount of literary creativity, which makes the poem interesting and worthwhile to read.
In recent years, a number of fruitful NLP approaches have emerged that are able to adequately model various aspects of natural language.In particular, neural network language models have improved the state of the art in language modeling, while topic models are successful at capturing clearcut, semantic dimensions.In this paper, we explore how these approaches can be adapted and combined in order to model both the linguistic and literary aspects that are required for poetry generation.More specifically, we make use of recurrent neural networks in an encoder-decoder configuration.The encoder first constructs a representation of an entire sentence by sequentially incorporating each word of the sentence into a fixed-size hidden state vector.The final representation is then given to the decoder, which emits a sequence of words according to a probability distribution derived from the hidden state of the input sentence.By training the network to predict the next sentence with the current sentence as input, the network learns to generate plain text with a certain discourse coherence.By modifying the probability distribution yielded by the decoder, we enforce the incorporation of poetic constraints, such that the network can be exploited for the generation of poetic verse.It is important to note that the poetry system is not trained on poetic texts; rather, the system is trained on a corpus of standard, prosaic texts extracted from the web, and it will be the constraints applied to the network's probability distribution that confer a poetic character to the generated verse.
The rest of this article is structured as follows.In section 2, we present an overview of related work on automatic poetry generation.Section 3 describes the different components of our model.In section 4, we present an extensive human evaluation of our model, as well as a number of examples generated by the system.Section 5, then, concludes and discusses some future research directions.

Related work
Early computational implementations that go beyond mere mechanical creativity have often relied on rule-based or template-based methods.One of the first examples is the ASPERA system (Gervás, 2001) for Spanish, which relies on a complex knowledge base, a set of rules, and case-based reasoning.Other approaches include Manurung et al. (2012), which combines rule-based generation with genetic algorithms, Gonçalo Oliveira (2012)'s PoeTryMe generation system, which relies on chart generation and various optimization strategies, and Veale (2013), which exploits metaphorical expressions using a pattern-based approach.
Whereas poetry generation with rule-based and template-based models has an inherent tendency to be somewhat rigid in structure, advances in statistical methods for language generation have opened up new avenues for a more varied and heterogeneous approach to creative language generation.Greene et al. (2010), for example, use an n-gram language model in combination with a rhythmic model implemented with finite-state transducers.And more recently, recurrent neural networks (RNNs) have been exploited for poetry generation; Zhang and Lapata (2014) use an encoder-decoder RNN for Chinese poetry generation, in which one RNN builds up a hidden representation of the current line in a poem, and another RNN predicts the next line word by word, based on the hidden representation of the current line.The system is trained on a corpus of Chinese poems.Yan (2016) tries to improve upon the encoderdecoder approach by incorporating a method of iterative improvement: the network constructs a candidate poem in each iteration, and the representation of the former iteration is used in the creation of the next one.And Wang et al. (2016) extend the method using an attention mechanism.Ghazvininejad et al. (2016) combine RNNs (for syntactic fluency) with distributional similarity (for the modeling of semantic coherence) and finite state automata (for imposing literary constraints such as meter and rhyme).Their system, Hafez, is able to produce well-formed poems with a reasonable degree of semantic coherence, based on a userdefined topic.Hopkins and Kiela (2017) focus on rhythmic verse; they combine an RNN, trained on a phonetic representation of poems, with a cascade of weighted finite state transducers.Lau et al. (2018) present a joint neural network model for the generation of sonnets, called Deep-speare, that incorporates the training of rhyme and rhythm into the neural network; the network learns iambic stress patterns from data, while rhyming word pairs are separated from non-rhyming ones using a marginbased loss.And a number of recent papers extend neural poetry generation for Chinese with various improvements, such as unsupervised style disentanglement (Yang et al., 2018), reinforcement learning (Yi et al., 2018), and rhetorical control (Liu et al., 2019).
Note that all existing statistical models are trained on or otherwise make use of a corpus of poetry; to our knowledge, our system is the first to generate poetry with a model that is exclusively trained on a generic corpus, which means the poetic character is endowed by the model itself.Secondly, we make use of a latent semantic model in order to model topical coherence, which is equally novel.

Neural architecture
The core of the poetry system is a neural network architecture, trained to predict the next sentence S i+1 given the current sentence S i .The architecture is made up of gated recurrent units (GRUs; Cho et al., 2014) that are linked together in an encoderdecoder setup.The encoder sequentially reads in each word w i 1,...,N of sentence S i (represented by its word embedding x) such that, at each time step t i , a hidden state ĥt is computed based on the current word's embedding x t and the previous time step's hidden state ĥt−1 .For each time step, the hidden state ĥt is computed according to the following equations: in the previous time step) and the previous time step's hidden state h t−1 (the first hidden state of the decoder is initialized by ĥN and the first word is a symbolic start token).The computations for each time step h t of the decoder are equal to the ones used in the encoder (equations 1 to 4).
In order to fully exploit the entire sequence of representations yielded by the encoder, we augment the base architecture with an attention mechanism, known as general attention (Luong et al., 2015).The attention mechanism allows the decoder to consult the entire set of hidden states computed by the encoder; at each time-step-for the generation of each word in sentence S i+1 -the decoder determines which words in sentence S i are relevant, and accordingly selects a linear combination of the entire set of hidden states.In order to do so, we first compute an attention vector a t , which attributes a weight to each hidden state ĥi yielded by the encoder (based on the decoder's current hidden state h t ).according to equation 5: where The next step is to compute a global context vector c t , which is a weighted average (based on attention vector a t ) of all of the encoder's hidden states.The resulting context vector is then combined with the original decoder hidden state in order to compute a new, attention-enhanced hidden state ht .
where [•; •] represents vector concatenation.Finally, this resulting hidden state ht is transformed into a probability distribution p(w t |w <t , S i ) over the entire vocabulary using a softmax layer.
As an objective function, the sum of the logprobabilities of the next sentence is optimized, conditioned on the hidden state representation of the current sentence.
At inference time, for the generation of a verse, each word is then sampled randomly according to the output probability distribution.Crucially, the decoder is trained to predict the next sentence in reverse, such that the last word of the verse is the first one that is generated.This reverse operation is important for an effective incorporation of rhyme, as will be explained in the next section.A graphical representation of the architecture, which includes the constraints discussed below, is given in Figure 1.

Poetic constraints as a priori distributions
As the neural architecture described above is trained on generic text, its output will in no way resemble poetic verse.In order to endow the generated output with a certain poetic character, we modify the neural network's output probability distribution through the application of a prior probability distribution, that constrains the standard output probability distribution, and boosts the probability of words that are a good fit within the defined constraints.We will consider two kinds of constraints: a rhyme constraint and a topical constraint.

Rhyme constraint
In order to adequately model the rhyme constraint, we make use of a phonetic representation of words, extracted from the online dictionary Wiktionary.The next step then consists in creating a probability distribution for a particular rhyme sound that we want the verse to adhere to: ) where R is the set of words that contain the required rhyme sound, is a small value close to zero, used for numerical stability, and Z is a normalization factor in order to ensure a probability distribution.We can now use p rhyme (w) as a prior probability distribution in order to reweight the neural network's standard output probability distributionaccording to Equation 11-each time the rhyme 1 www.wiktionary.orgscheme demands it: where represents pointwise multiplication. 2As we noted before, each verse is generated in reverse; the reweighting of rhyme words is applied at the first step of the decoding process, and the rhyme word is generated first.This prevents the generation of an ill-chosen rhyme word that does not fit well with the rest of the verse.

Topical constraint
For the modeling of topical coherence, we make use of a latent semantic model based on nonnegative matrix factorization (NMF; Lee & Seung, 2001).Previous research has shown that nonnegative factorization methods are able to induce clear-cut, interpretable topical dimensions (Murphy et al., 2012).As input to the method, we construct a frequency matrix A, which captures cooccurrence frequencies of vocabulary words and context words. 3This matrix is then factorized into two non-negative matrices W and H, where k is much smaller than i, j so that both instances and features are expressed in terms of a few components.Non-negative matrix factorization enforces the constraint that all three matrices must be non-negative, so all elements must be greater than or equal to zero.Using the minimization of the Kullback-Leibler divergence as an objective function, we want to find the matrices W and H for which the divergence between A and WH (the multiplication of W and H) is the smallest.The factorization is carried out through the iterative application of update rules.Matrices W and H are randomly initialized, and the rules in 13 and 14 are iteratively applied-alternating between them.In each iteration, each vector is adequately normalized, so that all dimension values sum to 1.
2 Such a multiplicative combination of probability distributions is also known as a Product of Experts (Hinton, 2002). 3The raw frequencies are weighted using pointwise mutual information (Turney and Pantel, 2010).
Tables 2 and 3   The factorization that comes out of the NMF model can be interpreted probabilistically (Gaussier and Goutte, 2005;Ding et al., 2008): matrix W can be considered as p(w|k), i.e. the probability of a word given a latent dimension k.In order to constrain the network's output to a certain topic, it would be straightforward to simply use p(w|k) as another prior probability distribution applied to each output.Initial experiments, however, indicated that such a blind modification of the output probability distribution for every word of the output sequence is detrimental to syntactic fluency.In order to combine syntactic fluency with topical consistency, we therefore condition the weighting of the output probability distribution on the entropy of that distribution: when the output distribution's entropy is low, the neural network is certain of its choice for the next word in order to generate a well-formed sentence, so we will not change it.On the other hand, when the entropy is high, we will modify the distribution by using the topical distribution p(w|k) for a particular latent dimension as prior probability distribution-analogous to Equation 11-in order to inject the desired topic.The entropy threshold, above which the modified distribution is used, is set experimentally.
Note that the rhyme constraint and the topical constraint can straightforwardly be combined in order to generate a topical rhyme word, through pairwise multiplication of the three relevant distributions, and subsequent normalization in order to ensure a probability distribution.

A global optimization framework
The generation of a verse is embedded within a global optimization framework.There are two reasons to integrate the generation of a verse within an optimization procedure.First of all, the generation of a verse is a sampling process, which is subject to chance.The optimization framework allows us to choose the best sample according to the constraints presented above.Secondly, the optimization allows us to define a number of additional criteria, that assist in the selection of the best verse.For each final verse, the model generates a considerable number of candidates; each candidate verse is then scored according to the following criteria: • the log-probability score of the generated verse, according to the encoder-decoder architecture (section 3.1); • compliance with the rhyme constraint (section 3.2.1);additionally, the extraction of the preceding group of consonants (cf.Table 1) allows us to give a higher score to rhyme words with disparate preceding consonant groups, which elicits more interesting rhymes; • compliance with the topical constraint (section 3.2.2); the score is modeled as the sum of the probabilities of all words for the defined dimension; • the optimal number of syllables, modeled as a Gaussian distribution with mean µ and standard deviation σ; 4 • the log-probability score of a standard n-gram model.
The score for each criterion is normalized to the interval [0, 1] using min-max normalization, and the harmonic mean of all scores is taken as the final score for each candidate. 5After generation of a predefined number of candidates, we keep the candidate with the highest score, and append it to the poem. 4We equally experimented with rhythmic constraints based on meter and stress, but initial experiments indicated that the system had a tendency to output very rigid verse.Simple syllable counting tends to yield more interesting variation. 5The harmonic mean is computed as ; we choose this measure in order to balance the different scores.
4 Results and evaluation

Implementational details
We train two different models for the generation of poetry in both English and French.The neural architecture is trained on a large corpus of generic web texts, constructed on the basis of the Com-monCrawl corpus. 6In order to filter out noise and retain clean, orderly training data, we apply the following filtering steps: • we only keep sentences written in the relevant language; • we only keep sentences of up to 20 words; • we only keep sentences that contain at least one function word from a predefined list-the idea again is to filter out noisy sentences, and only keep well-formed, grammatical ones; we create a list of about 10 highly frequent function words, specific to each language; • of all the sentences that remain after these filtering steps, we only keep the ones that appear successively within a document.
Using the filtering steps laid out above, we construct a training corpus of 500 million words for each language.We employ a vocabulary of 15K words (those with highest frequency throughout the corpus); less frequent words are replaced by an <unk> token, the probability of which is set to zero during generation.
Both encoder and decoder are made up of two GRU layers with a hidden state of size 2048, and the word embeddings are of size 512.Encoder, decoder, and output embeddings are all shared (Press and Wolf, 2017).Model parameters are optimized using stochastic gradient descent with an initial learning rate of 0.2, which is divided by 4 when the loss does no longer improve on a held-out validation set.We use a batch size of 64, and we apply gradient clipping.The neural architecture has been implemented using PyTorch (Paszke et al., 2017), with substantial reliance on the OpenNMT module (Klein et al., 2017).For the application of the topical constraint, we use an entropy threshold of 2.70.
The n-gram model is a standard Kneser-Ney smoothed trigram model implemented using KenLM (Heafield, 2011), and the NMF model is factorized to 100 dimensions.Both the n-gram 6 commoncrawl.orgmodel and the NMF model are trained on a large, 10 billion word corpus, equally constructed from web texts without any filtering steps except for language identification.For syllable length, we use µ = 12, σ = 2.
We generate about 2000 candidates for each verse, according to a fixed rhyme scheme (ABAB CDCD).Note that no human selection whatsoever has been applied to the poems used in the evaluation; all poems have been generated in a single run, without cherry picking the best examples.Four representative examples of poems generated by the system are given in Figure 2.

Evaluation procedure
Quantitatively evaluating creativity is far from straightforward, and this is no less true for creative artefacts that are automatically generated.Automatic evaluation measures that compute the overlap of system output with gold reference texts (such as BLEU or ROUGE), and which might be used for the evaluation of standard generation tasks, are of little use when it comes to creative language generation.The majority of research into creative language generation therefore makes use of some form of human evaluation, even though one needs to keep in mind that the evaluation of textual creativity is an inherently subjective task, especially with regard to poetic value.For a discussion of the subject, see Gonçalo Oliveira (2017).
We adopt the evaluation framework by Zhang and Lapata (2014), in which human annotators are asked to evaluate poems on a five point scale with regard to a number of characteristics, viz.
• fluency: is the poem grammatical and syntactically well-formed?
• coherence: is the poem thematically structured?
• meaningfulness: does the poem convey a meaningful message to the reader?
• poeticness: does the text display the features of a poem?
Additionally, we ask annotators to judge if the poem is written by a human or a computer.
In total, we evaluate four different sets of poems, yielded by different model instantiations.The different sets of poems considered during evaluation are: At the moment it seems almost impossible Yet life is neither good nor evil The divine mind and soul is immortal In other words, the soul is never ill So far, it has barely lost its youthful look But no man is ever too young for the rest He thought deeply, and yet his heart shook At that moment he seemed utterly possessed ~Malgré mon enthousiasme, le chagrin s'allonge Le bonheur est toujours superbe Toi, tu es un merveilleux songe Je te vois rêver de bonheur dans l'herbe Tu trouveras le bonheur de tes rêves Je t'aime comme tout le monde Je t'aime mon amour, je me lève Je ressens pour toi une joie profonde The moon represents unity and brotherhood The earth stands in awe and disbelief Other planets orbit the earth as they should The universe is infinite and brief The sky has been so bright and beautiful so far See the moon shining through the cosmic flame See the stars in the depths of the earth you are The planet the planet we can all see the same Rien ne prouve qu'il s'indigne Dans le cas contraire, ce n'est pas grave Si la vérité est fausse, c'est très mauvais signe Il est vrai que les gens le savent Et cela est faux, mais qu'importe En fait, le mensonge, c'est l'effroi La négation de l'homme en quelque sorte Le tort n'est pas de penser cela, il est magistrat Figure 2: Four representative examples of poems generated by the system; the left-hand poems, in English, are respectively generated using dimensions 13 and 28 (cf.Table 2); the right-hand poems, in French, are generated using dimensions 1 and 25 (cf.Table 3).
• rnn: poems generated by the neural architecture defined in section 3.1, without any added constraints; • rhyme: poems generated by the neural architecture, augmented with the rhyme constraint; • nmf rand : poems generated by the neural architecture, augmented with both the rhyme constraint and the topical constraint, where one of the automatically induced NMF dimensions is selected randomly; • nmf spec : poems generated by the neural architecture, augmented with both the rhyme constraint and the topical constraint, where one of the automatically induced NMF dimensions is specified manually.7 For a proper comparison of our system, we equally include: • random: poems yielded by a baseline model where, for each verse, we select a random sentence (that contains between 7 and 15 words) from a large corpus; the idea is that the lines selected by the baseline model should be fairly fluent (as they come from an actual corpus), but lacking in coherence (due to their random selection); • human: poems written by human poets; the scores on this set of poems function as an upper bound;

Results for English
For English, 22 annotators evaluated 40 poems in total (5 poems for each of the different sets considered in the evaluation; each poem was evaluated by at least 4 annotators).The annotators consist of native speakers of English, as well as master students in English linguistics and literature.For the human set, we select five poems by well-established English poets that follow the same rhyme scheme as the generated ones.9For nmf spec , we select dimension 13 of even with regard to grammatical fluency.The good scores on fluency for the constrained models indicate that the applied constraints do not disrupt the grammaticality of the generated verse (rhyme is significantly different 10 with p < 0.05; nmf rand and nmf spec with p < 0.01; recall that the baseline consists of actual sentences from a corpus).Secondly, we note that the rhyme constraint seems to improve poeticness (though not significantly), while the topical constraint seems to improve both coherence (p < 0.01 for nmf spec ) and meaningfulness (not significantly).Interestingly, a large proportion of the poems produced by the rhyme model are labeled as human, even though the other scores are fairly low.The score for poeticness is considerably higher (p < 0.01) for nmf spec (with a manually specified theme selected for its poeticness) than for nmf rand (with a randomly selected topic, which will often be more mundane).And the best scores on all criteria are obtained with the nmf spec model, for which more than half of the poems are judged to be written by a human; moreover, the difference between nmf spec and human poetry is not significant.Finally, our poetry generation compares favourably to previous work: nmf spec scores markedly and significantly better than Deep-speare (which does not differ significantly from the random baseline), and it equally attains better scores than Hafez on all 10 Significance testing is carried out using a two-tailed permutation test.
four criteria (though not significantly so).

Results for French
The setup of the French evaluation is analogous to the English one: an equal number of 22 annotators have evaluated a total of 30 poems (5 poems for each of the six sets considered in the evaluation; each poem was evaluated by at least 4 annotators).The annotators are all native speakers of French.For the human poems, we select five poems with the same rhyme scheme as the generated ones, among the highest ranked ones on short-edition.com-awebsite with submissions by amateur poets.For nmf spec , we select dimension 1 of Table 3.The results for French are presented in the lower part of Table 4.
Generally speaking, we see that the results for French confirm those for English.First of all, all model instantiations obtain better scores than the random baseline model, even with regard to fluency (p < 0.01), again confirming that the application of the rhyme constraint and topical constraint are not detrimental to the grammaticality of the verse.Secondly, the rhyme constraint significantly improves the score for poeticness (p < 0.05 compared to rnn), while the topical constraint improves both coherence (p < 0.05) and meaningfulness (p < 0.01).Contrary to the English results, only a small proportion of poems from the rhyme model are thought to be human.We do again see that the score for poeticness is considerably higher (p < 0.01) for nmf spec than for nmf rand , which seems to indicate that the topic of a poem is an important factor in people's judgements on poeticness.Finally, we again see that the best scores on all criteria are obtained with nmf spec , for which almost half of the poems are judged to be written by a human.

Conclusion
We presented a system for automatic poetry generation that is trained exclusively on standard, nonpoetic text.The system uses a recurrent neural encoder-decoder architecture in order to generate candidate verses, incorporating poetic and topical constraints by modifying the output probability distribution of the neural network.The best verse is then selected for inclusion in the poem, using a global optimization framework.We trained the system on both English and French, and equally carried out a human evaluation for both languages.The results indicate that the system is able to generate credible poetry, that scores well with regard to fluency and coherence, as well as meaningfulness and poeticness.Compared to previous systems, our model achieves state of the art performance, even though it is trained on standard, non-poetic text.In our best setup, about half of the generated poems are judged to be written by a human.
We conclude with a number of future research avenues.First of all, we would like to experiment with different neural network architectures.Specifically, we believe hierarchical approaches (Serban et al., 2017) as well as the Transformer network (Vaswani et al., 2017) would be particularly suitable to poetry generation.Secondly, we would like to incorporate further poetic devices, especially those based on meaning.Gripping poetry often relies on figurative language use, such as symbolism and metaphor.A successful incorporation of such devices would mean a significant step towards truly inspired poetry generation.And finally, we would like to adapt the model for automatic poetry translation-as we feel that the constraint-based approach lends itself perfectly to a poetry translation model that is able to adhere to an original poem in both form and meaning.
In order to facilitate reproduction of the results and encourage further research, the poetry generation system is made available as open source software.The current version can be downloaded at https://github.com/timvdc/poetry.
Graphical representation of the poetry generation model.The encoder encodes the current verse, and the final representation is given to the decoder, which predicts the next verse word by word in reverse.The attention mechanism is represented for the first time step.The rhyme prior is applied to the first time step, and the topic prior is optionally applied to all time steps, mediated by the entropy threshold of the network's output distribution.

Table 1 :
A number of rhyme examples extracted from Wiktionary, for both English and French.

Table 2 :
present a number of example dimensions induced by the model, for both English and French.Three example dimensions from the NMF model for English (4 words with highest probability)

Table 3 :
Three example dimensions from the NMF model for French (4 words with highest probability)

Table 2 .
The results of the evaluation for English are presented in the upper part of Table4.First of all, note that all our model instantiations score better than the random baseline model,

Table 4 :
Results of the human evaluation (mean score of all annotators) for English and French; values in bold indicate best performance of all generation models