Generation of Hip-Hop Lyrics with Hierarchical Modeling and Conditional Templates

This paper addresses Hip-Hop lyric generation with conditional Neural Language Models. We develop a simple yet effective mechanism to extract and apply conditional templates from text snippets, and show—on the basis of a large-scale crowd-sourced manual evaluation—that these templates significantly improve the quality and realism of the generated snippets. Importantly, the proposed approach enables end-to-end training, targeting formal properties of text such as rhythm and rhyme, which are central characteristics of rap texts. Additionally, we explore how generating text at different scales (e.g. character-level or word-level) affects the quality of the output. We find that a hybrid form—a hierarchical model that aims to integrate Language Modeling at both word and character-level scales—yields significant improvements in text quality, yet surprisingly, cannot exploit conditional templates to their fullest extent. Our findings highlight that text generation models based on Recurrent Neural Networks (RNN) are sensitive to the modeling scale and call for further research on the observed differences in effectiveness of the conditioning mechanism at different scales.


Introduction
Neural Networks approaches to text generation have recently proliferated partly due to substantial progress made in Language Modeling. Being essentially a generative model, a Language Model (LM) is fit by definition to drive natural language generation systems. LMs based on Neural architectures, such as RNNs, ConvNets or self-attentive models such as Transformers, provide better fits to the underlying data distributions of the training material (currently holding the state-of-the-art on common benchmarks) and are also assumed to produce more realistic text than their count-based counterparts (Karpathy, 2016).
The end-to-end nature of such models and the ability to leverage out-of-domain data through pretraining have led to a broadening of domains in which text generation systems are being developed and applied. In particular, interest has emerged or increased in the generation of artistic text such as poetry (Zhang and Lapata, 2014;Yan, 2016), literature (Manjavacas et al., 2017), song lyrics (Watanabe et al., 2018) or cooking recipes (Kiddon et al., 2016), etc. In the present paper, we focus on generating Hip-Hop lyrics, a genre known for its relatively liberal formal properties (e.g. rhythm and rhyme) and topic specificity.
While full algorithmic modeling of the highlevel creative process of song composition remains a challenge, we seek to improve the quality of the generated text by focusing on formal text properties. Typically, generating text with formal structure is done by applying constraints over the LM output distribution. By contrast, in this paper, we follow an end-to-end approach to generate text snippets that directly match the required structure. Our proposal makes use of templates based on sentence-level conditions (Ficler and Goldberg, 2017) that allow us to enforce rhyme and verse structure as it naturally occurs in training data. Our focus on Hip-Hop allows us to crowd-source an extensive collection of authenticity judgments through an online pseudo-Turing serious game. The evaluation shows the efficiency of the approach by bringing human guessing performance to chance-level.
Finally, while architectural improvements in Neural LMs target both character and word-level modeling, the application of LMs to artistic text generation has mostly focused on the word level. This situation is more likely the result of words being a central component of the creative process -e.g. topic, style and concepts are best modeled at the word level (Ghazvininejad et al., 2016;Yan, 2016)-, than a lack of generation capabilities of character-level LMs (Karpathy et al., 2015). Moreover, despite a few exceptions (Jagfeld et al., 2018), comparisons of LM-based generation systems at different scales are not common. To fill this gap, we explore the effects of modeling scale on text generation quality -including characterlevel, syllable-level and a hierarchical LM (HLM) -as well as the interplay between modeling scale and the proposed conditional template approach.
More specifically, we make the following contributions. (i) We introduce a simple approach to template-based generation, suitable for genres with a loose formal structure such as poetry or Hip-Hop. Crucially, this approach does not require search or constrained decoding to generate formally correct output. (ii) We present a comparison of unconstrained language generation with LMs at different scales and provide empirical evidence that hierarchical modeling produces more realistic output than both character-level and word-level modeling. (iii) We find that the success of the conditioning mechanism is dependent on the LM scale and the type of condition. In particular, we find that the gains from hierarchical modeling do not compound with the benefits obtained from the conditioning mechanism, which calls for further research on the matter.

Related Work
Much research has been devoted to poetry generation systems, and the field has reached a considerable degree of maturity (see (Gonçalo Oliveira, 2017)). A variety of approaches based on LMs have been proposed, including both Markov Models (Barbieri et al., 2012) and RNNs (Zhang and Lapata, 2014;Ghazvininejad et al., 2016;Yan, 2016;Hopkins and Kiela, 2017;Lau et al., 2018). In the literature on Hip-Hop lyric generation, besides an RNN-based LM (Potash et al., 2015), researchers have explored retrieval-based approaches where a Support Vector Machine is trained to select the continuing sentence based on formal properties of the text and global semantic coherence (Malmi et al., 2016). Moreover, various strategies have been proposed to generate text that matches specific verse structures. For example, Zhang and Lapata (2014) follow a generate-andselect approach that discards non-rhyming lines, and Ghazvininejad et al. (2016)

Dataset
The training data for this study was derived from the Original Hip-Hop (Rap) Lyrics Archive (OHHLA) 1 , an online archive documenting Hip-Hop through since 1992 and offering a large collecting of Hip-Hop lyrics. A total of 64,542 songs were collected. The database contains almost exclusively English songs, although codeswitching is common. The final corpus is the result of the following pre-processing steps. First, each text was tokenized using the Ucto tokenizer (Van Gompel et al., 2012). Second, all words were segmented into syllables using an in-house LSTM-based syllabifier trained on the CMU Pronouncing Dictionary (Lenzo, 2007). The syllabifier's segmentation accuracy is well over 99% on both a held-out development and test set, supporting confidence in its application to the lyric data. Finally, we applied the G2P toolkit 2 to extract phonological representations of words and corresponding stress patterns that will be exploited during training. The syllabified corpus consists of 43,531,133 syllables comprising 89,337 syllable types. A summary of overall corpus statistics is shown in Table 1 and Table 2.

LM-based Text Generation
We generate text by sampling from a LM implemented on top of LSTM Networks (Hochreiter and Schmidhuber, 1997) trained to predict the next symbol y in the sequence given the history using the following definition: Regardless of the details of specific architectures, sampling is done in the following way. Let x k be the activation for the k th vocabulary symbol at the penultimate layer (i.e. before the output softmax layer). At any given step, we sample from the multinomial distribution defined by: where V refers to the vocabulary and T is a temperature parameter controlling the models confidence. We leave other sampling approaches such as Top-K Sampling (Fan et al., 2018) and Nucleus Sampling (Holtzman et al., 2019) for future work.

Hierarchical Language Model
LMs are typically trained at the character or wordlevel. Character-level modeling has the advantages of (i) reducing the vocabulary size and (ii) increasing the number of training examples available during model fitting, but it incurs the cost of enlarging the number of steps to account for a given dependency between any two given input words. Arguably, a hybrid approach that models language at the character level but also incorporates word-level information flow should provide a way out of such a trade-off (Karpathy et al., 2015). Therefore, in the present paper, we compare text generation at three levels: character-level, syllable-level and a hierarchical LM (HLM) that integrates both levels. Note that we consider syllable-level instead of word-level based on twofold reasoning: (i) similar to sub-word models -such as those induced through Byte-Pair-Encoding (Sennrich et al., 2016) or SentencePiece (Kudo and Richardson, 2018) -, syllable-level segmented input helps limiting the exploding vocabulary size of noisy corpora. (ii) Syllables play a more central role than words in a particularly rhythmic genre like Hip-Hop in which, moreover, a tendency towards monosyllabic words reduces the vocabulary differences for word-level modeling.
As mentioned above, the key idea behind HLM is to allow different layers to specialize in modeling the information flow at different scales. In order to achieve that, the HLM uses the chain-rule of probability to decompose the probability of a sentence into the product of the probabilities of words (exactly as in the word-level LM) but, furthermore, it decomposes the probability of each word into the product of the probabilities of its characters: 3 (3) We implement the HLM with LSTM layers at different scales. A first bidirectional LSTM inp takes the input sequence of character embeddings of the current word w t and produces word-level features concatenating the final activations of the forward and backward pass. Secondly, LSTM word takes word-level feature vector w t 4 and the recurrent state to generate sentence-level features s t = LSTM word (w t , s t−1 ). Finally, LSTM out computes the vector of scores x for the next character c i+1 t+1 of the target word w t+1 using the previously decoded character embedding c i t+1 , s t and the recurrent state h i t+1 : where W is a matrix that maps the LSTM output to the vocabulary space. HLM is a specific case of the Hierarchical Multi-scale LSTM by Chung et al. (2016) with the differences that HLM uses a fixed segmentation at syllable boundaries instead of implicitly learning a segmentation model, and that HLM only considers bottom-up information passing across layers. Interestingly, despite the simplification HLM achieves similar results on the Penn Treebank benchmark corpus (see Table 3).

Conditional Templates
Recent research has shown the effectiveness of a conditioning mechanism for controlled text generation (Ficler and Goldberg, 2017), which uses specific sentence-level information during training  (e.g. tense, mood, sentiment or formal/informal style) to bias the generation towards text that reflects such conditions. Formally, such conditional information is encoded using condition embeddings and is fed into the LMs through vector concatenation. More formally, let c be a given condition with N c assignments. For each c we allocate an embedding matrix C c ∈ R Nc×d . During training, each model input embedding is concatenated with a vector of condition embeddings c = [c 1 ; . . . ; c m ] representing the conditional information corresponding to the input sentence. We deploy such conditioning mechanism in the form of conditional templates to the task of generating Hip-Hop lyrics. The idea behind conditional templates is to leverage the training material to bias the generation towards more realistic output. Consider the task of generating a verse consisting of m lines of Hip-Hop. In such a case, we sample a verse from the training corpus consisting of m lines and apply the corresponding conditions to a conditionally trained LM. The next question is what sentence-level information can be easily extracted and used to improve the quality and realism of the output. In the case of Hip-Hop, we focus on two formal characteristics that most typically represent Hip-Hop lyrics: rhythm and rhyme (Condit-Schultz, 2017).
Rhythm in Hip-Hop is characterized by a strict alignment between beat and stress with high correspondence between syntactic units and measures and a relatively stable ratio of number of syllables per beat (Adams, 2009). In order to approximate this stylistic feature 5 , we condition our LMs 5 The approximation lies in the fact that we ignore stress patterns in the template source. Initially, we exper-on a measure of verse length. In particular, we count the number of syllables of each line in the verse and bucket them according to the following ranges: < 10, (10 − 15), (15 − 20) and > 20.
Rhyme Hip-Hop employs liberal rhyme patterns in terms of placement -e.g., off-beat, syncopated rhyme, etc. -and often relies on imperfect matches (e.g. slant rhyme Adams, 2009). To mimic such rhyming style, we condition LMs on phonological endings, which we define, in alignment with a loose notion of rhyme, as the syllabic nucleus of the last stressed syllable followed by the syllabic nuclei of any following syllables. For example, the rhyme-based condition corresponding to the line 'unite around the corner' is AO1-ERO -i.e. the ARPABET representations corresponding to the stressed syllabic nuclei of 'cor-' and '-ner'. 6 Such a representation is then shared with other rhyming words such as 'daughter' or 'offer'. A successfully trained conditional LM can thus generate rhymes when the same phonological condition is passed to the networks for two consecutive lines. Similarly, templates from the corpus contain rhyming schemes and patterns (such as AABB, ABAB, etc.) that the conditional models can exploit for a more realistic effect. Table 4 shows example generations from conditional models at all three considered scales.

Model Training Details
We implement all models in PyTorch (Paszke et al., 2017), with the following parametrizations. Input and condition embedding layers have dimensionality of 100. Non-hierarchical models have 2 hidden LSTM layers with 640 units per layer. By definition the HLM has 2 LSTM layers in addition to the bidirectional LSTM layer that computes character-level word embeddings. For replication purposes, our implementation is available online. 7 imented with conditioning on line-level stress patterns extracted through a cluster analysis but found the approach inconclusive. The difficulty stems from the fact that Hip-Hop artists commonly shift the word stress in order to align it to the underlying beat, and such misalignment cannot be recovered based on only text. 6 We focus on generating rhyming in verse-final position, which represents the most abundant type. We extract a total of 430 such phonological endings in our corpus, from which only 270 involve an actual rhyme in the training corpus. Interestingly, however, we observed that the conditional models generalize so as to generate rhymes on phonological endings that have not been seen during training. 7 Code can found in the following url https://www. github.com/emanjavacas/hierarchical-lm.   We trained all models with a cross-entropy objective targeted at predicting the next symbol in the sequence. Parameter optimization was done using the Adam optimizer (Kingma and Ba, 2015) with default hyperparameters. Models are regularized using dropout (Srivastava et al., 2014) on the input embeddings, variational dropout (Gal and Ghahramani, 2016) on hidden recurrent layers, and default L2 penalty on model parameters. Finally, we stop training based on an early-stopping criterion computed after each epoch on held-out data. Ta-ble 5 shows total number model parameters and development perplexity per configuration.

Evaluation
Our first evaluation concerns the quality of the Hip-Hop snippets generated by each of the six architectures (three modeling scales, each with a conditioned variant). We focus on the effectiveness of the conditional template approach and hierarchical modeling.
Evaluating artistic text generation poses additional challenges, mostly due to the absence of reference text against which a model output can be compared. While some authors rely on questionnaires addressing poetic properties of interest (e.g., "poeticness", "grammaticality", "meaningfulness") for evaluation (Das and Gambäck, 2014), we instead turn to a Turing-like setup that allowed us to crowdsource a large-scale pool of user authenticity judgments. In order to encourage user participation, we implemented a serious game where participants were shown Hip-Hop samples of lengths of 3 to 4 lines and were tasked to guess whether the dis-played text was generated or real in 15 seconds. 8 Participants were motivated by being shown feedback immediately after each answer. Furthermore, the game entered a "sudden-death" phase after the first ten guesses, in which a wrong answer would finish the game. Finally, a leader-board was kept visible, showing the scores of the ten best performing participants. The resulting dataset underlying the present evaluation comprises 3,620 guesses by 670 participants.
In order to leverage the collected evaluations, we model guessing behavior using a Logistic Regression model (implemented in brms, Bürkner and others, 2017), taking into account userspecific variability through the inclusion of varying intercepts (i.e. for each participant, we use a unique intercept parameter). Our evaluation strategy contrasts with similar approaches in the literature -(e.g. Netzer et al., 2009) -which typically only provide raw empirical, single point estimates. Regularized estimates obtained from using a varying intercepts model provide more accurate estimates for individual user intercepts, enabling predictions about future behavior that are less prone to both over-and underfitting (cf. McElreath, 2015). Additionally, the interaction between generation scale and conditioning are modeled as fixed effects.
As shown in Figure 1, hierarchical modeling outperforms both character and syllable-level models in the unconditioned setup, with the median guessing accuracy dropping to 54.6%. Moreover, conditional templates push guessing accuracy further down for all models, with HLM and syllable-level achieving a median accuracy of 51.9% and 49.4%, respectively. Interestingly, the effect of the conditional templates differs across models. The smallest effect is observed for the hierarchical model (decrease of 2.6 points), followed by the character-level model (decrease of 6.7 points), while the effect on syllablelevel model corresponds to a decrease of 13.4 points. The relatively high impact of conditioning on syllable-level generation contrasts sharply with the much smaller improvement on both characterlevel and HLM.
On first sight, this result seems to suggest that conditioning is more effective at higher modeling levels, perhaps hinting at optimization incompat-8 Generated text was sampled at random from one of the six models. ibilities between sentence-level conditioning and character-level training objectives. Though, in order to better understand this result, we pursue the following two questions. First, why are conditional templates much more effective at a higherlevel scale (i.e., syllable-level)? Second, what textual properties characterize text generated by different models, and can explain the better performance of the hierarchical model in particular?

Modeling scale and conditioning
To better understand the divergent effectiveness of conditioning at different scales, we investigate to what extent different models succeed at generating text that matches the conditions required by the template. Note that the benefits of a conditioned model might not be restricted to a model's ability to fulfill the target template conditions-for example, rhyme and rhythm information results in a better fit to the data as shown in Table 5. However, successfully replicating formal structures seen in the training data ensures a level of realism by definition and thus can be interpreted as an, at least partial, explanation for the observed differences in performance.
In order to quantify the ability of models to successfully generate the requested templates, we generate a dataset of lines exploring the space of possible templates. For each of the 2150 combinations (430 rhyming conditions by 5 length buck-ets), we sample 1000 lines. We then syllabify each of the lines using the same pre-processing pipeline described in Section 3. Subsequently, rhyme generation accuracy (Acc) can be quantified by the proportion of generated lines with the expected phonological ending. Moreover, we also quantify rhyme diversity (H) -i.e. the entropy of the distribution of successfully generated rhyme words. Finally, in order to quantify the ability to meet target verse length, we compute the average difference in syllables between the generated verse length and corpus-level average length per bucket (Diff).
The results in Table 6 show that syllable-level is, in fact, most accurate and diverse at generating rhyme by a large extent. The HLM achieves higher accuracy than the character-level but similar diversity. Overall, rhyme diversity is notably lower in generated text than in real text (H = 1.669), a result that is in agreement with the expectations. In terms of rhythm, we observe a different picture: the character-level model generates lines much closer to the observed data than both HLM and syllable-level. From the last two, the HLM improves over syllable-level but both tend to produce shorter lines.
These results seem to suggest that characterlevel RNNs excel at modeling surface-level information responsible for estimating the current number of processed symbols, and can thus very accurately replicate the verse lengths observed in the training data. Moreover, it seems that the syllablelevel model can derive a more substantial improvement from the conditional templates because rhyming patterns have an arguably more prominent impact on the perceived realism -however, we will leave an analysis of perceived realism for future work. Finally, it appears that in terms of conditioning, our hierarchical model does not succeed in exploiting the best of both worlds.

Modeling scale and text quality
We now turn to characterize the effect of modeling text at different scales as well as, more specifically, what textual properties single out hierarchically generated text. In order to approach this question, we utilize the unconditioned output from the main experiment and conduct a feature analysis across the following linguistic levels: Prosody We quantify rhyme as the proportion of rhyming lines in the snippet, assonance as the proportion of the most frequent stressed syllable  nucleus over total number of syllables, and alliteration as the proportion of consecutive words with equal consonant onset.
Morphology We approximate the morphological complexity of the text with the average wordlength in syllables.
Lexical level lexical diversity is measured using entropy based on overall word distributions. Moreover, we also quantify the proportion of consecutive word repetitions, which represent a common artifact in LM-based text generation (Holtzman et al., 2019).
Syntax We approximate syntactic complexity based on the average mean tree depth from the corresponding dependency parse trees of the snippet lines. Parse trees are extracted using the dependency parser provided by AllenNLP 9 based on Dozat and Manning (2016) and trained on the PTB3.0 corpus.
Based on such features, we fit a Random Forest to classify the model underlying the corresponding text snippet. We resort to the machine learning library scikit-learn (Pedregosa et al., 2011) for the implementation and extract feature importance scores following the feature permutation approach detailed in Parr et al. (2018). Furthermore, in order to extract feature-class associations (i.e., which class each feature is mostly predictive of) odds-ratios are computed based on a linear model taking character-level as reference class. The resulting Random Forest achieves 91.7 outof-bag accuracy, which provides certainty that the feature-set has sufficiently large coverage. Figure 2 ranks features by importance scores. As we can see, word-length is by far the strongest predictor. The feature is most strongly associated with HLM and slightly less with syllable-level modeling. Following word-length, we encounter syntax (mean tree-depth) and lexical diversity, which again are mostly associated with HLMwith odds-ratios in favour of HLM amounting to  3.8 and 2.4, respectively. Furthermore, prosodic features -in particular assonance -play a role in distinguishing character-level output from the other models. Finally, word repetition and rhyme density show near-zero importance scores.
Based on the present feature analysis, it can be inferred that one of the main advantages of hierarchical modeling relates to increased lexical diversity, which is further boosted by the ability to generate longer and more morphologically complex words. On the other hand, character-level modeling seems to be better characterized by surfacelevel prosodic features (in particular assonance). This analysis would connect with the interpretation advanced in Section 5.1, in which characterlevel modeling was shown to provide an accurate replication of a related surface-level textual property: rhythm as captured by verse length.

Discussion & Conclusion
Based on a large-scale evaluation involving hundreds of participants and authenticity judgments, we have shown that the modeling scale influences the quality of generated Hip-Hop lyrics. A feature analysis shows that hierarchically generated text displays morphologically and syntactically more complex output as well as higher lexical diversity. All such properties may help explain the better scores achieved by the hierarchical model in the absence of conditioning.
Furthermore, the proposed end-to-end approach to enforce formal structure in texts has similarly proved efficient. It reduces the human guessing accuracy of all models and is particularly efficient in the case of syllable-level modeling. Moreover, our analysis of the interplay between modeling scale and conditioning showed that syllable-level modeling displays much greater ability to exploit rhyme templates than the other lower-scale mod-eling variants. This advantage can help to explain the more pronounced effect of conditional templates on syllable-level modeling when considering that rhyme patterns contribute arguably more strongly to the realism of a generated snippet. And yet, character-level modeling scored much better than the other models at generating the requested verse lengths and was shown to be positively associated with prosodic features such as assonance by the feature analysis. Both results thus seem to suggest that character-level modeling has an edge at capturing surface-level information. The overall lower performance of the characterlevel model implies, however, that such an advantage does not translate in improved realism as perceived by the participants.
Finally, the evaluation shows that despite the already mentioned advantages of hierarchical modeling, the effects of conditional templates did not compound in this case. This result is somewhat discouraging, since the primary motivation of hierarchical modeling is to overcome deficiencies of both character and word-level modeling. The analysis in Section 5.1 shows that our implementation of the conditioned HLM scores in between the other two models. Future research might be able to overcome this drawback by carefully designing adaptive mechanisms that let the model decide to which layer in the hierarchy a particular type of sentence-level conditional embedding should be fed.

Future Work
Our study suggests several directions for future work. The most urgent issue, briefly touched upon above, concerns an investigation of improved hierarchical architectures that can exploit conditioning information better than either character-level and word-level models in isolation. Moreover, the positive results obtained for the hierarchical model in isolation encourage scaling up the modeling hierarchy, investigating the inclusion of higher scales such as stanza and document level. Furthermore, while we have only considered templates covering formal aspects of the text, the same approach can be extended to include content features such as keyword or stanza-level topic information. Finally, in this study, we have restricted ourselves to relatively short snippets of text, but future work should move on and consider an evaluation on more substantial text portions.