Sounds Wilde. Phonetically Extended Embeddings for Author-Stylized Poetry Generation

This paper addresses author-stylized text generation. Using a version of a language model with extended phonetic and semantic embeddings for poetry generation we show that phonetics has comparable contribution to the overall model performance as the information on the target author. Phonetic information is shown to be important for English and Russian language. Humans tend to attribute machine generated texts to the target author.


Introduction
Generative models for natural languages and for poetry specifically are discussed for more than fifty years (Wheatley, 1965). Lamb et al. (2017) provides a detailed overview of generative poetry techniques. This particular paper addresses the issue of author stylized poetry (Tikhonov and Yamshchikov, 2018a) and shows the importance of phonetics for the author-stylized poetry generation.
The structure of the poem can vary across different languages starting with highly specific and distinct structures of Chinese poems (Zhang et al., 2017) and finishing with poems where formal structure hardly exists, e.g. American poetry of the twentieth century (say, the lyrics of Charles Bukowski) or so-called white poems in Russian. The structure and standards of poems can depend on various factors primarily phonetic in thier nature. In the broadest sense, rhymes in a classical western sonnet, a structure of a Persian ruba'i, a sequence of tones in a Chinese quatrain or a structure within rap bars could be expressed as a set of phonetic rules based on a certain understanding of expressiveness and euphony shared across a given culture or, sometimes, an artistic community. Some cultures and styles also have particular semantic limitations or 'standards', for example, 'centrality' of certain topics in classical Japanese poetry, see (Maynard, 1994). We do not make attempts to address high-level semantic structure, however one can add some kind of pseudo-semantic rules to the model discussed further, say via some mechanism in line with (Ghazvininejad et al., 2016) or (Yi et al., 2017). The importance of phonetics in poetical texts was broadly discussed among Russian futuristic poets, see (Kruchenykh, 1923). Several Russian linguistic circles and art groups (particularly OPOJAZ) in the first quarter of 20th century were actively discussing the concept of the abstruse language, see (Shklovsky, 1919), stressing also that the form of a poem, and especially its acoustic structure, is a number one priority for the future of literature. In their recent paper (Blasi et al., 2016) have challenged the broadly accepted idea that sound and meaning are not interdependent: unrelated languages very often use (or avoid) the same sounds for specific referents. In (He et al., 2016) and (Ghannay et al., 2016) it was show that acoustic word embeddings improve algorithm performance on a number of NLP tasks. In line with these ideas, one of the key features of the model that we discuss below is its concatenated embedding that contains information on the phonetics of every word preprocessed by a bi-directional Long Short-Term Memory (LSTM) network alongside with its vectorized semantic representation.
In (Tikhonov and Yamshchikov, 2018a) a model for generation of texts resembling the writing style of a particular author within the training data set was proposed. In this paper we quantify the stylistic similarity of the generated texts and show the importance of extension of the word embeddings with phonetic information for the overall performance of the model. The proposed model might also be applicable to prose, but diverse phonetic structure of the poetry discussed above makes it better suited for the purposes of this paper. Also, since one would like to incorporated human judgement of the generated text and measure how well a human reader can attribute generated text to the target author, poetry seems preferable to prose for its stylistic expressiveness.
The contribution of this paper is three-fold: (1) we propose an LSTM with extended phonetic and semantic embeddings and quantify the quality of the obtained stylized poems both subjectively through a survey and objectively with BLEU metrics; (2) we show that phonetic information plays key role in a author-stylized poetry generation (3) we demonstrate that the proposed approach works in a multilingual setting, providing examples in English and in Russian.

Related work
In (Lipton et al., 2015), (Kiddon et al., 2016), (Lebret et al., 2016), (Radford et al., 2017), (Tang et al., 2016), (Hu et al., 2017) a number of RNNbased generative or generative adversarial models for controlled text generation are developed. These papers took content and semantics of the output into consideration, yet did not work with the style of the generated texts. In  the authors focused on the speaker consistency in a dialogue. In (Sutskever et al., 2011) and in (Graves, 2013) it is demonstrated that a character-based recurrent neural network with gated connections or LSTM networks respectively can generate texts that resemble news or Wikipedia articles. Chinese classical poetry due to its diverse and deeply studied structure is addressed in (He et al., 2012), (Yi et al., 2017), (Yan, 2016), (Yan et al., 2016), or (Zhang et al., 2017). In (Ghazvininejad et al., 2016) an algorithm generates a poem in line with a user-defined topic in (Potash et al., 2015) stylized rap lyrics are generated with LSTM trained on a rap poetry corpus.
There is a diverse understanding of literary style that lately became obvious due to the growing attention to the problems of automated style transfer. For a brief overview of the state-of-the-art style transfer problem see (Tikhonov and Yamshchikov, 2018b). Style is sometimes regarded as a sentiment of a text (see (Shen et al., 2017) or (Li et al., 2018)), it's politeness (Sennrich et al., 2016) or style of the time (see (Hughes et al., 2012)). In (Fu et al., 2017) authors generalize these ideas measuring the success of a particular style aspect with a specifically trained classifier. In (Guu et al., 2017) it is shown that an existent human-written source used to control the saliency of the output can significantly improve the quality of the resulting texts. Generative models on the other hand often do not have such input and have to generate stylized texts from scratch, like in (Ficler and Goldberg, 2017).

Model
We use an LSTM-based language model that predicts the w n+1 word based on w 1 , ..., w n previous inputs and some other parameters of the modeled sequence. A schematic picture of the model is shown in Figure 1, document information projections obtained as a matrix multiplication of document embedding on a projection matrix the dimensionality of which differs according to the target dimensionality of a projection are highlighted with blue arrows. An LSTM with 1152-dimensional input and 512-dimensional state.  Figure 2 shows a concatenated word representation of the model. The representation includes a 512-dimensional projection of a concatenated author and document embeddings at every step and two 128-dimensional vectors corresponding to finals states of two bidirectional LSTMs. The first LSTM works with a char-representation of the word and the second one uses phonemes of the International Phonetic Alphabet 1 , employing an heuristics to transcribe words into phonemes. A somewhat similar idea, but with convolutional neural networks rather than with LSTMs, was pro-   posed in (Jozefowicz et al., 2016).

Datasets
Two proprietary datasets of English and Russian poetry were used for training. All punctuation was deleted, every character was transferred to a lower case. No other preprocessing was made. The datasets sizes can be found in The model produces results of comparable quality for both languages, so in order to make this paper shorter, we further address generative poems in English only and provide the experimental results for Russian in the Appendix. We want to emphasize that we do not see any excessive difficulties in implementation of the proposed model for other languages for which one can form a training corpus and provide a phonetically transcribed vocabulary. Table 3 shows some generated stylized poetry examples. The model captures syntactic characteristics of the author (note the double negation in the first and the last line of Neuro-marley) alongside with the vocabulary ('burden', 'darkness', 'fears' could be subjectively associated with gothic lyrics of Poe, whereas 'sunshine', 'fun', 'fighting every rule' could be associated with positive yet rebellious reggae music). Author-specific vocabulary can technically imply specific phonetics that characterizes a given author, however this implication is not self evident and generally speaking does not have to hold. As we demonstrate later, phonetics does, indeed, contribute to the author stylization significantly.

Experiments
In (Tikhonov and Yamshchikov, 2018a) the detailed description of the experiments is provided alongside with a new metric for automated stylization quality estimation -sample cross entropy. In this submission we specifically address the results that deal with the phonetics of the generated texts.

BLEU
BLEU is a metric estimating the correspondence between a machine's output and that of a human and is usually mentioned in the context of machine translation. We suggest to adopt it for the task of stylized text generation in the following way: a random starting line is sampled out of the humanwritten poems and initializes the generation. Generative model 'finishes' the poem generating thee ending lines of the quatrain. Then one calculates BLEU between three actual lines that finished the human-written quatrain starting with a given first line and lines generated by the model when initialized with the same human-written line. Table 4 shows BLEU calculated on the validation dataset for the plain vanilla LSTM, LSTM

Neuro-Poe
Neuro-Marley her beautiful eyes were bright don t you know you ain t no fool this day is a burden of tears you r gonna make some fun the darkness of the night but she s fighting every rule our dreams of hope and fears ain t no sunshine when she s gone Table 3: Examples of the generated stylized quatrains. The punctuation is omitted since it was omitted in the training dataset.
with author information support but without bidirectional LSTMs for phonemes and characters included in the embeddings and the full model. The uniform random and weighted random give baselines to compare the model to.   Table 4 shows that extended phonetic embeddings play significant role in the overall quality of the generated stylized output. It is important to mention that phonetics in an implicit characteristic of an author and the training dataset (in line with the definition of style in (Tikhonov and Yamshchikov, 2018b)), humans do not have qualitative insights into phonetic of Wilde or Cobain, yet the information on it turns out to be important for the style attribution.

Survey data
Two quatrains from William Shakespeare, Lewis Carroll, Bob Marley and MUSE band were sampled. They were accompanied by two quatrains generated by the model conditioned on those four authors respectively. One hundred and forty fluent English-speakers were asked to read all 16 quatrains in randomized order and choose one option out of five offered for each quatrain, i.e. the author of this verse is William Shakespeare, Lewis Carroll, Bob Marley, MUSE or an Artificial Neural Network. The summary of the obtained results is shown in Table 5. Analogous results but for Russian language could be seen in Appendix in Table  8 alongside with more detailed description of the methodology. It is important to note that the generated pieces for tests were human-filtered for mistakes, such as demonstrated in Table 6, whereas the automated metrics mentioned above were estimated on the whole sample of generated texts without any human-filtering.
Looking at Table 5 one can see the model has achieved good results in author stylization. Indeed the participants recognized Shakespeare more than 46% of the times (almost 2.5 times more often than compared with a random choice) and did slightly worse in their recognition of Bob Marley (40% of cases) and MUSE (39% of cases, still 2 times higher than a random choice). This shows that the human-written quatrains were, indeed, recognizable and the participants were fluent enough in the target language to attribute given texts to the correct author. At the same time, people were 'tricked' into believing that the text generated by the model was actually written by a target author in 37% of cases for Neuro-Shakespeare, 47% for Neuro-Marley, and 34% for Neuro-MUSE, respectively. Somehow, Lewis Carroll turned out to be less recognizable and was recognized in the survey only in 20% of cases (corresponds to a purely random guess). The subjective underperformance of the model on this author can therefore be explained with the difficulty experienced by the participants in determining his authorship.
Combining the results in Table 4 with the results of the survey shown in Table 5 one could conclude that phonetic structure of lyrics has impact on the correct author attribution of the stylized content. This impact is usually not acknowledged by a human reader but is highlighted with the proposed experiment.

Conclusion
In this paper we have addressed a problem of author-stylized text generation. It has been shown that the extending word embeddings with phonetic information has a comparable impact on the BLEU  Table 5: Results of a survey with 140 respondents. Shares of each out of 5 different answers given by people when reading an exempt of a poetic text by the author stated in the first column. The two biggest values in each row are marked with * and ** and a bold typeface. of the generative model as the information on the authors of the text. It was also shown that, when faced with an author with a recognizable style (an author who is recognized approximately two times more frequently than at random), humans mistakenly recognize the output of the proposed generative model for the target author as often as they correctly attribute original texts to the author in question. The experiments were carried out in English and in Russian and there are no obvious obstacles for the application of the same approach to other languages. Reading the raw output we could see several types of recurring characteristic errors that are typical for LSTM-based text generation. They can be broadly classified into several different types:

A Examples of output
• If the target author is underrepresented in the training dataset, model tends to make more mistakes, mostly, syntactic ones; • Since generation is done in a word-by-word manner, model can deviate significantly when sampling a low-frequency word; • Pronouns tend to cluster together, possibly due to the problem of anaphoras in the training dataset; • The line can end abruptly, this problem also seems to occur more frequently for the authors that are underrepresented in the training dataset. Table 7 lists some subjectively cherry-picked especially successful examples of the system outputs both for English and for Russian language. Since text is generated line by line and verses are obtained through random rhyme or rhythm filtering several types of serendipitous events occur. They can be broadly classified into four different types: N euro − Lenov : nonsense do many a fair honour best of make or lose о о о о и о Table 6: Examples of several recurring types of mistakes that occur within generated lyrics.
• Wording of the verse that fits into the style of the target author; • Pseudo-plot that is perceived by the reader due to a coincidental cross-reference between two lines; • Pseudo-metaphor that is perceived by the reader due to a coincidental cross-reference between two lines; • Sentiment and emotional ambience that correspond to the subjective perception of the target author.

B Survey design
The surveys were designed identically for English and Russian languages. We have recruited the respondents via social media, the only prerequisite was fluency in the target language. Respondents were asked to determine an authorship for 16 different 4-line verses. The verses for human-written text were chosen randomly out of the data for the given author. The generated verses were obtained through line-by-line automated rhyme and rhythm heuristic filtering. Since LSTMs are not perfect in text generation and tend to have clear problems illustrated in Table 6 we additionally filtered generative texts leaving the verses that do not contain obvious mistakes described above. Each of the 16 questions consisted of a text (in lower case with a stripped-off punctuation) and a multiple choice options listing five authors, namely, four human authors and an artificial neural network. Respondents were informed that they are to distinguish human-and machine-written texts. The correct answers were not shown to the respondents. Table 5 shows the results of the survey for English texts and Table 8 for Russian ones. Higher values in every row correspond to the options that were more popular among the respondents, when they were presented with the text written by the author listed in the first column of the table.

Serendipity English
Russain N euro − Shakespeare : N euro − P ushkin : peculiar a sense i may not comprehend во славу вакха или тьмы wording of whom i had not to defend мы гордо пировали N euro − M arley : N euro − Esenin : apophenic oh lord i know how long i d burn ты под солнцем стоишь и в порфире plot take it and push it it s your turn как в шелку беззаботно горишь N euro − Carroll : N euro − Zemf ira : apophenic your laugh is bright with eyes that gleam ветер в голове metaphor that might have seen a sudden dream с красной тенью шепчется N euro − M use : N euro − Letov : peculiar i am the man of this universe только в ушах отражается даль sentiment i remember i still am a curse только белая смерть превращается в ад  Table 8: Results of a survey with 178 respondents. Shares of each out of 5 different answers given by people when reading an exempt of a poetic text by the author stated in the first column. The two biggest values in each row are marked with * and ** and a bold typeface.