Deep-speare: A joint neural model of poetic language, meter and rhyme

In this paper, we propose a joint architecture that captures language, rhyme and meter for sonnet modelling. We assess the quality of generated poems using crowd and expert judgements. The stress and rhyme models perform very well, as generated poems are largely indistinguishable from human-written poems. Expert evaluation, however, reveals that a vanilla language model captures meter implicitly, and that machine-generated poems still underperform in terms of readability and emotion. Our research shows the importance expert evaluation for poetry generation, and that future research should look beyond rhyme/meter and focus on poetic language.


Introduction
With the recent surge of interest in deep learning, one question that is being asked across a number of fronts is: can deep learning techniques be harnessed for creative purposes?Creative applications where such research exists include the composition of music (Humphrey et al., 2013; Sturm  et al., 2016; Choi et al., 2016), the design of sculptures (Lehman et al., 2016), and automatic choreography (Crnkovic-Friis and Crnkovic-Friis,  2016).In this paper, we focus on a creative textual task: automatic poetry composition.
A distinguishing feature of poetry is its aesthetic forms, e.g.rhyme and rhythm/meter. 1In this work, we treat the task of poem generation as a constrained language modelling task, such that lines of a given poem rhyme, and each line follows a canonical meter and has a fixed number Shall I compare thee to a summer's day?Thou art more lovely and more temperate: Rough winds do shake the darling buds of May, And summer's lease hath all too short a date: of stresses.Specifically, we focus on sonnets and generate quatrains in iambic pentameter (e.g.see Figure 1), based on an unsupervised model of language, rhyme and meter trained on a novel corpus of sonnets.
Our findings are as follows: • our proposed stress and rhyme models work very well, generating sonnet quatrains with stress and rhyme patterns that are indistinguishable from human-written poems and rated highly by an expert; • a vanilla language model trained over our sonnet corpus, surprisingly, captures meter implicitly at human-level performance; • while crowd workers rate the poems generated by our best model as nearly indistinguishable from published poems by humans, an expert annotator found the machine-generated poems to lack readability and emotion, and our best model to be only comparable to a vanilla language model on these dimensions; • most work on poetry generation focuses on meter (Greene et al., 2010; Ghazvininejad et al.,  2016; Hopkins and Kiela, 2017); our results suggest that future research should look beyond meter and focus on improving readability.
In this, we develop a new annotation framework for the evaluation of machine-generated poems, and release both a novel data of sonnets and the full source code associated with this research. 2arly poetry generation systems were generally rule-based, and based on rhyming/TTS dictionaries and syllable counting (Gervás, 2000; Wu et al.,  2009; Netzer et al., 2009; Colton et al., 2012;  Toivanen et al., 2013).The earliest attempt at using statistical modelling for poetry generation was Greene et al. (2010), based on a language model paired with a stress model.Neural networks have dominated recent research.Zhang and Lapata (2014) use a combination of convolutional and recurrent networks for modelling Chinese poetry, which Wang et al.  (2016) later simplified by incorporating an attention mechanism and training at the character level.For English poetry, Ghazvininejad et al. (2016) introduced a finite-state acceptor to explicitly model rhythm in conjunction with a recurrent neural language model for generation.Hopkins and Kiela  (2017) improve rhythm modelling with a cascade of weighted state transducers, and demonstrate the use of character-level language model for English poetry.A critical difference over our work is that we jointly model both poetry content and forms, and unlike previous work which use dictionaries (Ghazvininejad et al., 2016) or heuristics (Greene  et al., 2010) for rhyme, we learn it automatically.

Sonnet Structure and Dataset
The sonnet is a poem type popularised by Shakespeare, made up of 14 lines structured as 3 quatrains (4 lines) and a couplet (2 lines);3 an example quatrain is presented in Figure 1.It follows a number of aesthetic forms, of which two are particularly salient: stress and rhyme.
A sonnet line obeys an alternating stress pattern, called the iambic pentameter, e.g.: S − S + S − S + S − S + S − S + S − S + Shall I compare thee to a summer's day?where S − and S + denote unstressed and stressed syllables, respectively.
A sonnet also rhymes, with a typical rhyming scheme being ABAB CDCD EFEF GG.There are a number of variants, however, mostly seen in the quatrains; e.g.AABB or ABBA are also common.
We build our sonnet dataset from the latest image of Project Gutenberg. 4We first create a (generic) poetry document collection using the GutenTag tool (Brooke et al., 2015), based on its inbuilt poetry classifier and rule-based structural tagging of individual poems.
Given the poems, we use word and character statistics derived from Shakespeare's 154 sonnets to filter out all non-sonnet poems (to form the "BACKGROUND" dataset), leaving the sonnet corpus ("SONNET"). 5Based on a small-scale manual analysis of SONNET, we find that the approach is sufficient for extracting sonnets with high precision.BACKGROUND serves as a large corpus (34M words) for pre-training word embeddings, and SONNET is further partitioned into training, development and testing sets.Statistics of SON-NET are given in Table 1. 6

Architecture
We propose modelling both content and forms jointly with a neural architecture, composed of 3 components: (1) a language model; (2) a pentameter model for capturing iambic pentameter; and (3) a rhyme model for learning rhyming words.
Given a sonnet line, the language model uses standard categorical cross-entropy to predict the next word, and the pentameter model is similarly trained to learn the alternating iambic stress patterns. 7The rhyme model, on the other hand, uses a margin-based loss to separate rhyming word pairs from non-rhyming word pairs in a quatrain.For generation we use the language model to generate one word at a time, while applying the pentame- 5 The following constraints were used to select sonnets: 8.0 mean words per line 11.5; 40 mean characters per line 51.0; min/max number of words per line of 6/15; min/max number of characters per line of 32/60; and min letter ratio per line 0.59. 6The sonnets in our collection are largely in Modern English, with possibly a small number of poetry in Early Modern English.The potentially mixed-language dialect data might add noise to our system, and given more data it would be worthwhile to include time period as a factor in the model. 7There are a number of variations in addition to the standard pattern (Greene et al., 2010), but our model uses only the standard pattern as it is the dominant one.

Language Model
The language model is a variant of an LSTM encoder-decoder model with attention (Bahdanau  et al., 2015), where the encoder encodes the preceding context (i.e.all sonnet lines before the current line) and the decoder decodes one word at a time for the current line, while attending to the preceding context.
In the encoder, we embed context words z i using embedding matrix W wrd to yield w i , and feed them to a biLSTM9 to produce a sequence of encoder hidden states Next we apply a selective mechanism (Zhou et al., 2017) to each h i .By defining the representation of the whole context h = [ h C ; h 1 ] (where C is the number of words in the context), the selective mechanism filters the hidden states h i using h as follows: where denotes element-wise product.Hereinafter W, U and b are used to refer to model parameters.The intuition behind this procedure is to selectively filter less useful elements from the context words.
In the decoder, we embed words x t in the current line using the encoder-shared embedding matrix (W wrd ) to produce w t .In addition to the word embeddings, we also embed the characters of a word using embedding matrix W chr to produce c t,i , and feed them to a bidirectional (character-level) LSTM: We represent the character encoding of a word by concatenating the last forward and first back- where L is the length of the word.We incorporate character encodings because they provide orthographic information, improve representations of unknown words, and are shared with the pentameter model (Section 4.2). 10 The rationale for sharing the parameters is that we see word stress and language model information as complementary.
Given the word embedding w t and character encoding u t , we concatenate them together and feed them to a unidirectional (word-level) LSTM to produce the decoding states: We attend s t to encoder hidden states h i and compute the weighted sum of h i as follows: To combine s t and h * t , we use a gating unit similar to a GRU (Cho et al., 2014; Chung et al.,  2014): . We then feed s t to a linear layer with softmax activation to produce the vocabulary distribution (i.e.softmax(W out s t + b out ), and optimise the model with standard categorical cross-entropy loss.We use dropout as regularisation (Srivastava et al., 2014), and apply it to the encoder/decoder LSTM outputs and word embedding lookup.The same regularisation method is used for the pentameter and rhyme models.
As our sonnet data is relatively small for training a neural language model (367K words; see Table 1), we pre-train word embeddings and reduce parameters further by introducing weight-sharing between output matrix W out and embedding matrix W wrd via a projection matrix W prj (Inan  et al., 2016; Paulus et al., 2017; Press and Wolf,  2017):

Pentameter Model
This component is designed to capture the alternating iambic stress pattern.Given a sonnet line, 10 We initially shared the character encodings with the rhyme model as well, but found sub-par performance for the rhyme model.This is perhaps unsurprising, as rhyme and stress are qualitatively very different aspects of forms.
the pentameter model learns to attend to the appropriate characters to predict the 10 binary stress symbols sequentially.11As punctuation is not pronounced, we preprocess each sonnet line to remove all punctuation, leaving only spaces and letters.Like the language model, the pentameter model is fashioned as an encoder-decoder network.
In the encoder, we embed the characters using the shared embedding matrix W chr and feed them to the shared bidirectional character-level LSTM (Equation ( 1)) to produce the character encodings for the sentence: In the decoder, it attends to the characters to predict the stresses sequentially with an LSTM: ) where u * t−1 is the weighted sum of character encodings from the previous time step, produced by an attention network which we describe next,12 and g t is fed to a linear layer with softmax activation to compute the stress distribution.
The attention network is designed to focus on stress-producing characters, whose positions are monotonically increasing (as stress is predicted sequentially).We first compute µ t , the mean position of focus: where M is the number of characters in the sonnet line.Given µ t , we can compute the (unnormalised) probability for each character position: where standard deviation T is a hyper-parameter.
We incorporate this position information when computing u * t :13 Intuitively, the attention network incorporates the position information at two points, when computing: (1) d t j by weighting the character encodings; and (2) f t by adding the position log probabilities.This may appear excessive, but preliminary experiments found that this formulation produces the best performance.
In a typical encoder-decoder model, the attended encoder vector u * t would be combined with the decoder state g t to compute the output probability distribution.Doing so, however, would result in a zero-loss model as it will quickly learn that it can simply ignore u * t to predict the alternating stresses based on g t .For this reason we use only u * t to compute the stress probability: which gives the loss L ent = t − log P (S t ) for the whole sequence, where S t is the target stress at time step t.
We find the decoder still has the tendency to attend to the same characters, despite the incorporation of position information.To regularise the model further, we introduce two loss penalties: repeat and coverage loss.
The repeat loss penalises the model when it attends to previously attended characters (See et al.,  2017), and is computed as follows: By keeping a sum of attention weights over all previous time steps, we penalise the model when it focuses on characters that have non-zero history weights.
The repeat loss discourages the model from focussing on the same characters, but does not assure that the appropriate characters receive attention.Observing that stresses are aligned with the vowels of a syllable, we therefore penalise the model when vowels are ignored: where V is a set of positions containing vowel characters, and C is a hyper-parameter that defines the minimum attention threshold that avoids penalty.
To summarise, the pentameter model is optimised with the following loss: where α and β are hyper-parameters for weighting the additional loss terms.

Rhyme Model
Two reasons motivate us to learn rhyme in an unsupervised manner: (1) we intend to extend the current model to poetry in other languages (which may not have pronunciation dictionaries); and (2) the language in our SONNET data is not Modern English, and so contemporary dictionaries may not accurately reflect the rhyme of the data.
Exploiting the fact that rhyme exists in a quatrain, we feed sentence-ending word pairs of a quatrain as input to the rhyme model and train it to learn how to separate rhyming word pairs from non-rhyming ones.Note that the model does not assume any particular rhyming scheme -it works as long as quatrains have rhyme.
A training example consists of a number of word pairs, generated by pairing one target word with 3 other reference words in the quatrain, i.e. {(x t , x r ), (x t , x r+1 ), (x t , x r+2 )}, where x t is the target word and x r+i are the reference words. 14e assume that in these 3 pairs there should be one rhyming and 2 non-rhyming pairs.From preliminary experiments we found that we can improve the model by introducing additional non-rhyming or negative reference words.Negative reference words are sampled uniform randomly from the vocabulary, and the number of additional negative words is a hyper-parameter.
For each word x in the word pairs we embed the characters using the shared embedding matrix W chr and feed them to an LSTM to produce the character states u j . 15Unlike the language and pentameter models, we use a unidirectional forward LSTM here (as rhyme is largely determined by the final characters), and the LSTM parameters are not shared.We represent the encoding of the whole word by taking the last state u = u L , where L is the character length of the word.
Given the character encodings, we use a margin-based loss to optimise the model: where top(Q, k) returns the k-th largest element in Q, and δ is a margin hyper-parameter.
Intuitively, the model is trained to learn a sufficient margin (defined by δ) that separates the best pair with all others, with the second-best being used to quantify all others.This is the justification used in the multi-class SVM literature for a similar objective (Wang and Xue, 2014).
With this network we can estimate whether two words rhyme by computing the cosine similarity score during generation, and resample words as necessary to enforce rhyme.

Generation Procedure
We focus on quatrain generation in this work, and so the aim is to generate 4 lines of poetry.During generation we feed the hidden state from the previous time step to the language model's decoder to compute the vocabulary distribution for the current time step.Words are sampled using a temperature between 0.6 and 0.8, and they are resampled if the following set of words is generated: (1) UNK token; (2) non-stopwords that were generated before; 16 (3) any generated words with a frequency 2; (4) the preceding 3 words; and (5) a number of symbols including parentheses, single and double quotes. 17The first sonnet line is generated without using any preceding context.
We next describe how to incorporate the pentameter model for generation.Given a sonnet line, the pentameter model computes a loss L pm (Equation (3)) that indicates how well the line conforms to the iambic pentameter.We first generate 10 candidate lines (all initialised with the same hidden state), and then sample one line from the candidate lines based on the pentameter loss values (L pm ).We convert the losses into probabilities by taking the softmax, and a sentence is sampled with temperature = 0.1.
To enforce rhyme, we randomly select one of the rhyming schemes (AABB, ABAB or ABBA) and resample sentence-ending words as necessary.Given a pair of words, the rhyme model produces a cosine similarity score that estimates how well the two words rhyme.We resample the second word of a rhyming pair (e.g. when generating the second A in AABB) until it produces a cosine similarity 0.9.We also resample the second word of a nonrhyming pair (e.g. when generating the first B in AABB) by requiring a cosine similarity 0.7. 18hen generating in the forward direction we can never be sure that any particular word is the last word of a line, which creates a problem for resampling to produce good rhymes.This problem is resolved in our model by reversing the direction of the language model, i.e. generating the last word of each line first.We apply this inversion trick at the word level (character order of a word is not modified) and only to the language model; the pentameter model receives the original word order as input.

Experiments
We assess our sonnet model in two ways: (1) component evaluation of the language, pentameter and rhyme models; and (2) poetry generation evaluation, by crowd workers and an English literature expert.A sample of machine-generated sonnets are included in the supplementary material.
We tune the hyper-parameters of the model over the development data (optimal configuration in the supplementary material).Word embeddings are initialised with pre-trained skip-gram embeddings (Mikolov et al., 2013a,b) on the BACKGROUND dataset, and are updated during training.For optimisers, we use Adagrad (Duchi et al., 2011) for the language model, and Adam (Kingma and Ba, 2014) for the pentameter and rhyme models.We truncate backpropagation through time after 2 sonnet lines, and train using 30 epochs, resetting the network weights to the weights from the previous epoch whenever development loss worsens.

Language Model
We use standard perplexity for evaluating the language model.In terms of model variants, we have: 19 • LM: Vanilla LSTM language model; • LM * : LSTM language model that incorporates character encodings (Equation (2)); 18 Maximum number of resampling steps is capped at 1000.If the threshold is exceeded the model is reset to generate from scratch again. 19All models use the same (applicable) hyper-parameter configurations.Table 2: Component evaluation for the language model ("Ppl" = perplexity), pentameter model ("Stress Acc"), and rhyme model ("Rhyme F1").
Each number is an average across 10 runs.
• LM * * : LSTM language model that incorporates both character encodings and preceding context; • LM * * -C: Similar to LM * * , but preceding context is encoded using convolutional networks, inspired by the poetry model of Zhang and Lapata (2014);20 • LM * * +PM+RM: the full model, with joint training of the language, pentameter and rhyme models.
Perplexity on the test partition is detailed in Table 2. Encouragingly, we see that the incorporation of character encodings and preceding context improves performance substantially, reducing perplexity by almost 10 points from LM to LM * * .The inferior performance of LM * * -C compared to LM * * demonstrates that our approach of processing context with recurrent networks with selective encoding is more effective than convolutional networks.The full model LM * * +PM+RM, which learns stress and rhyme patterns simultaneously, also appears to improve the language model slightly.

Pentameter Model
To assess the pentameter model, we use the attention weights to predict stress patterns for words in the test data, and compare them against stress patterns in the CMU pronunciation dictionary. 21Words that have no coverage or have nonalternating patterns given by the dictionary are discarded.We use accuracy as the metric, and a predicted stress pattern is judged to be correct if it matches any of the dictionary stress patterns.
To extract a stress pattern for a word from the model, we iterate through the pentameter (10 time steps), and append the appropriate stress (e.g.1st time step = S − ) to the word if any of its characters receives an attention 0.20.
For the baseline (Stress-BL) we use the pretrained weighted finite state transducer (WFST) provided by Hopkins and Kiela (2017). 22The WFST maps a sequence word to a sequence of stresses by assuming each word has 1-5 stresses and the full word sequence produces iambic pentameter.It is trained using the EM algorithm on a sonnet corpus developed by the authors.
We present stress accuracy in Table 2. LM * * +PM+RM performs competitively, and informal inspection reveals that a number of mistakes are due to dictionary errors.To understand the predicted stresses qualitatively, we display attention heatmaps for the the first quatrain of Shakespeare's Sonnet 18 in Figure 3.The y-axis represents the ten stresses of the iambic pentameter, and the pairwise rhyme strength between two words.The model's objective is to maximise poem likelihood over all possible rhyme scheme assignments under the latent variables φ and θ.We train this model (Rhyme-EM) on our data24 and use the learnt θ to decide whether two words rhyme. 25able 2 details the rhyming results.The rhyme model performs very strongly at F1 > 0.90, well above both baselines.Rhyme-EM performs poorly because it operates at the word level (i.e. it ignores character/orthographic information) and hence does not generalise well to unseen words and word pairs. 26o better understand the errors qualitatively, we present a list of word pairs with their predicted cosine similarity in Table 3. Examples on the left side are rhyming word pairs as determined by the CMU dictionary; right are non-rhyming pairs.Looking at the rhyming word pairs (left), it appears that these words tend not to share any wordending characters.For the non-rhyming pairs, we spot several CMU errors: (sire, ire) and (queen, been) clearly rhyme.

Crowdworker Evaluation
Following Hopkins and Kiela (2017), we present a pair of quatrains (one machine-generated and one human-written, in random order) to crowd workers on CrowdFlower, and ask them to guess which is the human-written poem.Generation quality is estimated by computing the accuracy of workers at correctly identifying the human-written poem (with lower values indicate better results for the model).
We generate 50 quatrains each for LM, LM * * and LM * * +PM+RM (150 in total), and as a control, generate 30 quatrains with LM trained for one epoch.An equal number of human-written quatrains was sampled from the training partition.A HIT contained 5 pairs of poems (of which one is a control), and workers were paid $0.05 for each HIT.Workers who failed to identify the human-written poem in the control pair reliably (minimum accuracy = 70%) were removed by CrowdFlower automati- cally, and they were restricted to do a maximum of 3 HITs.To dissuade workers from using search engines to identify real poems, we presented the quatrains as images.
Accuracy is presented in Table 4.We see a steady decrease in accuracy (= improvement in model quality) from LM to LM * * to LM * * +PM+RM, indicating that each model generates quatrains that are less distinguishable from human-written ones.Based on the suspicion that workers were using rhyme to judge the poems, we tested a second model, LM * * +RM, which is the full model without the pentameter component.We found identical accuracy (0.532), confirming our suspicion that crowd workers depend on only rhyme in their judgements.These observations demonstrate that meter is largely ignored by lay persons in poetry evaluation.

Expert Judgement
To better understand the qualitative aspects of our generated quatrains, we asked an English literature expert (a Professor of English literature at a major English-speaking university; the last author of this paper) to directly rate 4 aspects: meter, rhyme, readability and emotion (i.e.amount of emotion the poem evokes).All are rated on an ordinal scale between 1 to 5 (1 = worst; 5 = best).In total, 120 quatrains were annotated, 30 each for LM, LM * * , LM * * +PM+RM, and human-written poems (Human).The expert was blind to the source of each poem.The mean and standard deviation of the ratings are presented in Table 5.
We found that our full model has the highest ratings for both rhyme and meter, even higher than human poets.This might seem surprising, but in fact it is well established that real poets regularly break rules of form to create other effects (Adams,  1997).Despite excellent form, the output of our model can easily be distinguished from humanwritten poetry due to its lower emotional impact and readability.In particular, there is evidence here that our focus on form actually hurts the readability of the resulting poems, relative even to the simpler language models.Another surprise is how well simple language models do in terms of their grasp of meter: in this expert evaluation, we see only marginal benefit as we increase the sophistication of the model.Taken as a whole, this evaluation suggests that future research should look beyond forms, towards the substance of good poetry.

Conclusion
We propose a joint model of language, meter and rhyme that captures language and form for modelling sonnets.We provide quantitative analyses for each component, and assess the quality of generated poems using judgements from crowdworkers and a literature expert.Our research reveals that vanilla LSTM language model captures meter implicitly, and our proposed rhyme model performs exceptionally well.Machine-generated generated poems, however, still underperform in terms of readability and emotion.

Figure 2 :
Figure 2: Architecture of the language, pentameter and rhyme models.Colours denote shared weights.
s h a l l i c o m p a re t he e t o a s um m e r s d a y t ho u a rt m o re lov e ly a nd m o re te m pe ra te rough w inds do shake the darling buds of may and summer s lease hath all too short a date

Figure 3 :
Figure 3: Character attention weights for the first quatrain of Shakespeare's Sonnet 18.

Table 5 :
Expert mean and standard deviation ratings on several aspects of the generated quatrains.