Rigid Formats Controlled Text Generation

Neural text generation has made tremendous progress in various tasks. One common characteristic of most of the tasks is that the texts are not restricted to some rigid formats when generating. However, we may confront some special text paradigms such as Lyrics (assume the music score is given), Sonnet, SongCi (classical Chinese poetry of the Song dynasty), etc. The typical characteristics of these texts are in three folds: (1) They must comply fully with the rigid predefined formats. (2) They must obey some rhyming schemes. (3) Although they are restricted to some formats, the sentence integrity must be guaranteed. To the best of our knowledge, text generation based on the predefined rigid formats has not been well investigated. Therefore, we propose a simple and elegant framework named SongNet to tackle this problem. The backbone of the framework is a Transformer-based auto-regressive language model. Sets of symbols are tailor-designed to improve the modeling performance especially on format, rhyme, and sentence integrity. We improve the attention mechanism to impel the model to capture some future information on the format. A pre-training and fine-tuning framework is designed to further improve the generation quality. Extensive experiments conducted on two collected corpora demonstrate that our proposed framework generates significantly better results in terms of both automatic metrics and the human evaluation.


Introduction
Recent years have seen the tremendous progress in the area of natural language generation especially benefiting by the neural network models such as Recurrent Neural Networks (RNN) or Convolutional Neural Networks (CNN) based sequence-tosequence (seq2seq) frameworks (Bahdanau et al.,1 Code: http://github.com/lipiji/SongNet Let me not to the marriage of true minds Admit impediments, love is not love

Which alters when it alteration finds
Or bends with the remover to remove.
Lyrics SongCi Sonnet Figure 1: Examples of text with rigid formats. In lyrics, the syllables of the lyric words must align with the tones of the notation. In SongCi and Sonnet, there are strict rhyming schemes and the rhyming words are labeled in red color and italic font. 2014; Gehring et al., 2017), Transformer and its variants (Vaswani et al., 2017;, pre-trained auto-regressive language models such as XLNet  and GPT2 (Radford et al., 2019), etc. Performance has been improved significantly in lots of tasks such as machine translation Vaswani et al., 2017), dialogue systems (Vinyals and Le, 2015;Shang et al., 2015;Li, 2020), text summarization (Rush et al., 2015;Li et al., 2017;See et al., 2017), story telling (Fan et al., 2018;See et al., 2019), poetry writing (Zhang and Lapata, 2014;Lau et al., 2018;Liao et al., 2019), etc.
Generally, most of the above mentioned tasks can be regarded as free text generation, which means that no constraints on the format and structure, say the number of words and rhyming rules. Note that tasks of dialogue generation and story telling are almost in an open-ending generation style as long as the generated content is relevant with the conditional input text. Although there are formats constraints on the poetry text, the proposed models just treat the formats as kind of latent information and let the model capture this feature implicitly during training (Liao et al., 2019). The model trained on the five-character quatrain corpus cannot generate seven-character verses. Moreover, it is impossible to trigger these models to generate satisfying results according to arbitrary new defined formats.
In practice we will confront some special text paradigms such as Lyrics (assume the music score is given), Sonnet (say Shakespeare's Sonnets (Shakespeare, 2000)), SongCi (a kind of Ci. Ci is a type of lyric poetry in the tradition of Classical Chinese poetry. 2 , SongCi is the Ci created during Song dynasty), etc., and some examples are illustrated in Figure 1. The typical characteristics of these text can be categorized into three folds: (1) The assembling of text must comply fully with the predefined rigid formats. Assume that the music score is composed, then the lyricist must fill the lyric content strictly tally with the schemes lie in the notation. Take partial of song "Edelweiss" as shown in the first row of Figure 1 as example, the syllables of the lyric words must align with the tones of the notation. The second row of Figure 1 depicts the content of a SongCi created based on the CiPai of "Bu Suan Zi". Given the CiPai, the number of characters and the syntactical structure of the content are also defined (e.g., the number of characters of each clause: 5, 5. 7, 5. 5, 5. 7, 5.). (2) The arrangement of the content must obey the defined rhyming schemes. For example, all the final words (words in red color and italic font) of the SongCi content in Figure1 are rhyming (the spelling of each word is: "zhu", "yu", "du", and "gu".). The example in the third row of Figure 1 comes from Shakespeare's "Sonnet 116" (Shakespeare, 2000), the first four sentences. Usually, the rhyming schemes of Shakespeare's Sonnets is "ABAB CDCD EFEF GG" 3 . In the example, the rhyming words in scheme "ABAB" are "minds", "love", "finds", and "remove". (3) Even though the format is rigid, the sentence integrity must always be guaranteed. Incomplete sentence such as "love is not the" is inappropriate.
To the best of our knowledge, text generation based on the predefined rigid formats constraints has not been well investigated yet. In this work, 2 http://en.wikipedia.org/wiki/Ci (poetry) 3 http://en.wikipedia.org/wiki/Shakespeare%27s sonnets we propose a simple and elegant framework named SongNet to address this challenging problem. The backbone of the framework is a Transformer-based auto-regressive language model. Considering the three folds characteristics mentioned above, we introduce sets of tailor-designed indicating symbols to improve the modeling performance, especially for the robustness of the format, rhyme, as well as sentence integrity. We improve the attention mechanism to impel the model to capture the future information on the format to further enhance sentence integrity. Inspired by BERT (Devlin et al., 2019) and GPT (Radford et al., 2018(Radford et al., , 2019, a pretraining and fine-tuning framework is designed to further improve the generation quality. To verify the performance of our framework, we collect two corpora, SongCi and Sonnet, in Chinese and English respectively. Extensive experiments on the collected datasets demonstrate that our proposed framework can generate satisfying results in terms of both the tailor-designed automatic metrics including format accuracy, rhyming accuracy, sentence integrity, as well as the human evaluation results on relevance, fluency, and style.
In summary, our contributions are as follows: • We propose to tackle a new challenging task: rigid formats controlled text generation. A pre-training and fine-tuning framework named SongNet is designed to address the problem. • Sets of symbols are tailor-designed to improve the modeling performance. We improve the attention mechanism to impel the model to capture the future information to further enhance the sentence integrity. • To verify the performance of our framework SongNet, we collect two corpora, SongCi and Sonnet, in Chinese and English respectively. We design several automatic evaluation metrics and human evaluation metrics to conduct the performance evaluation. • Extensive experiments conducted on two collected corpora demonstrate that our proposed framework generates significantly better results given arbitrary formats, including the cold-start formats or even the formats newly defined by ourselves.

Task Definition
The task of rigid formats controlled text generation is defined as follows:  Input: a rigid format C ∈ C: where C is the set of all possible formats. Note that we can define arbitrary new formats not restricted to the ones pre-defined in the corpus, thus |C| → ∞. Format token c i denotes a place-holder symbol of C which need to be translated into a real word token. Format C contains 10 words plus two extra punctuation characters "," and "." Output: a natural language sentence Y ∈ Y which tally with the defined format C: Y = love is not love, bends with the remover to remove.
where the example sentences are extracted from the Shakespeare's Sonnets (Shakespeare, 2000). From the result Y we can observe that the count of words is 10 which is consistent with the format C. The punctuation characters "," and "." are also correct. Thus, we claim that it is a 100% format accuracy result. Also, since the two clause sentences are complete, we can get a good sentence integrity score. If C is defined on the literary genres of SongCi or Sonnet which have rhyming constraints, the rhyming performance should be evaluated as well. Recall that C can be arbitrary and flexible, thus we can rebuild a new format C based on the generated result Y by masking partial content, say C = {c 0 c 1 c 2 love, c 0 c 1 c 2 c 3 c 4 remove.}, then we may obtain better results by re-generating based on C . We name this operation as polishing.
Finally, the target of this problem is to find a mapping function G to conduct the rigid formats controlled text generation: 3 Framework Description

Overview
As shown in Figure 2, the backbone of our framework is a Transformer-based auto-regressive language model. The input can be the whole token sequences of samples from SongCi or Sonnet. We tailor-design several sets of indicating symbols to enhance the performance in terms of accuracy on format, rhyme, and sentence integrity. Specifically, symbols C = {c i } are introduced for format and rhyming modeling; Intra-position symbols P = {p i } are designed to represent the local positions of the tokens within each sentence aiming to improve the rhyming performance and the sentence integrity. Segment symbols S = {s i } are employed to identify the sentence border to further improve the sentence quality. Attention mechanism is improved to impel the model to capture the future format information such as the sentence ending markers. Similar to BERT (Devlin et al., 2019) and GPT (Radford et al., 2018(Radford et al., , 2019, pre-training and fine-tuning paradigm is utilized to boost the performance of the original models.

Details
We use two sentences (as shown in Figure 1) "love is not love, ..., bends with the remover to remove" extracted from the Shakespeare's Sonnets (Shakespeare, 2000) as examples to describe the details of our framework SongNet. Since our basic model is a Transformer-based auto-regressive language model, during training, the input is " bos love is not love, /s ..., bends with the remover to remove. /s ", and the corresponding output is a left-shifting version of the input (tokenized, and we ignore "..." for convenience and clarity): love is not love , /s bends with the remover to remove . /s eos where /s denotes the clause or sentence separator, and eos is the ending marker of the whole sequence. The target of our framework is to conduct the formats controlled text generation. Therefore, the indicating symbols for format and rhyme as well as the sentence integrity are designed based on the target output sequence. Format and Rhyme Symbols: where we use {c 0 } to represent the general tokens; {c 1 } depict the punctuation characters; {c 2 } represent the rhyming tokens "love" and "remove". /s and eos are kept. Intra-Position Symbols: {p i } denote the local positions of tokens within the same clause or sentence. Note that we align the position symbol indices in a descending order. The aim is to improve the sentence integrity by impelling the symbols capture the sentence dynamic information, precisely, the sense to end a sequence. For example, {p 0 } usually denote punctuation characters, thus {p 1 } should be the ending words of sentences. Segment Symbols: where s i is the symbol index for sentence i. The purpose is to enhance the interactions between different sentences in different positions by defining the sentence index features. During training, all the symbols as well as the input tokens are fed into the transformer-based language model. Contrast to Transformer (Vaswani et al., 2017), BERT (Devlin et al., 2019), and GPT2 (Radford et al., 2019), we modify the traditional attention strategies slightly to fit our problem.
Specifically, for the input, we first obtain the representations by summing all the embeddings of the input tokens and symbols, as shown in the red solid box of Figure 2: (6) where 0 is the layer index and t is the state index. E * is the embedding vector for input * . w t is the real token at position t. c, p, and s are three pre-defined symbols. g is the global position index same as position symbols used in Transformer (Vaswani et al., 2017). Moreover, the state at time t need to know some future information to grasp the global sequence dynamic information. For example, the model may want to know if it should close the decoding progress by generating the last word and a punctuation character to end the sentence. To represent the global dynamic information, we introduce another variable F 0 by only summing the pre-defined symbols as shown in the blue dash box of Figure 2: After processing the input, two blocks of attention mechanisms are introduced to conduct the feature learning procedure. The first block is a masking multi-head self-attention component, and the second block is named global multi-head attention.
Masking Multi-Head Self-Attention: where SLF-ATT(·), LN(·), and FFN(·) represent self-attention mechanism, layer normalization, and feed-forward network respectively. Note that we only use the states whose indices ≤ t as the attention context. After obtaining C 1 t from Equation (8), we feed it into the second attention block to capture the global dynamic information from F 0 . Global Multi-Head Attention: We can observe that all the context information from F 0 are considered. This is the reason why we name it as "global attention" and why the input real token information E wt is NOT considered. Then the calculation of the unified first model layer is finished. We can iteratively apply these two attention blocks on the whole L model layers until obtain the final representations H L . Note that H is renewed layerly, however the global variable F 0 is fixed. Finally, the training objective is to minimize the negative log-likelihood over the whole sequence:

Pre-training and Fine-tuning
Although our framework can be trained purely on the training dataset of the target corpus, usually the scale of the corpus is limited. For example, there are only about 150 samples in the corpus of Shakespeare's Sonnets (Shakespeare, 2000). Therefore, we also design a pre-training and fine-tuning framework to further improve the generation quality. Recall that in the task definition in Section 2, we claim that our model owns the ability of refining and polishing. To achieve this goal, we adjust the masking strategy used in BERT (Devlin et al., 2019) to our framework according to our definitions. Specifically, we randomly (say 20%) select partial of the original content and keep them not changed when building the format symbols C. For example, we will get a new symbol set C for the example sentences: C = {c0, c0, c0, love, c1, /s bends, c0, c0, c0, c0, remove, c1, /s , eos } where "love", "bends" and "remove" are kept in the format C . After the pre-training stage, we can conduct the fine-tuning procedure directly on the target corpus without adjusting any model structure.

Generation
We can assign any format and rhyming symbols C to control the generation. Given C, we will obtain P and S automatically. And the model can conduct generation starting from the special token bos iteratively until meet the ending marker eos . Both beam-search algorithm (Koehn, 2004) and truncated top-k sampling (Fan et al., 2018;Radford et al., 2019) method are utilized to conduct the decoding.

Settings
The parameter size of our model are fixed in both the pre-training stage and the fine-tuning stage. The number of layers L = 12, and hidden size is 768. We employ 12 heads in both the masking multihead self-attention block and the global attention block. Adam (Kingma and Ba, 2014) optimization method with Noam learning-rate decay strategy and 10,000 warmup steps is employed to conduct the pre-training.

Datasets
We conduct all the experiments on two collected corpus with different literary genres: SongCi and Sonnet, in Chinese and English respectively. The statistic number are shown in Table 3. We can see that Sonnet is in small size since we only utilize the samples from the Shakespeare's Sonnets (Shakespeare, 2000). Since SongCi and Sonnet are in different languages, thus we conduct the pre-training procedure on two large scale corpus in the corresponding languages respectively. For Chinese, we collect Chinese Wikipedia (1700M Characters) and a merged Chinese News (9200M Characters) corpus from the Internet. We did not conduct the word segmenting operations on the Chinese datasets, which means that we just use the characters to build the vocabulary, and the size is 27681. For English, same as BERT, we employ English Wikipedia (2400M words) and BooksCorpus (980M words) (Zhu et al., 2015) to conduct the pre-training. We did not use BPE operation (Sennrich et al., 2015) on this corpus considering the format controlling purpose. We keep the most frequent 50,000 words to build the vocabulary.

Evaluation Metrics
Besides PPL and Distinct (Li et al., 2016), we also tailor-design several metrics for our task to conduct the evaluation for format, rhyme, and sentence integrity. Format Assume that there are m sentences defined in the format C = {C s 1 , C s 2 , ..., C s m }, and the generated results Y contains n sentences Y = {Y s 1 , Y s 2 , ..., Y s n }. Without loss of generality, we align C and Y from the beginning, and calculate the format quality according to the following rules: (1) the length difference ||C s i | − |Y s i || ≤ δ; (2) the punctuation characters must be same. For SongCi, we let δ = 0 and rule (2) must be conforming.   For Sonnet, we relax the condition where we let δ = 1 and ignore rule (2). Assume that the number of format-correct sentences is n , then we can obtain Precision p = n /n, Recall r = n /m, and F1-measure. We report both the Macro-F1 and Micro-F1 in the results tables.
Rhyme For SongCi, usually, there is only one group of rhyming words in one sample. As the example shown in Table 1, the pronunciation of the red rhyming words are "zhu", "yü", "du", and "gu" respectively, and the rhyming phoneme is "u". For the generated samples, we first use the tool pinyin 4 to get the pronunciations (PinYin) of the words in the rhyming positions, and then conduct the evaluation. For Shakespeare's Sonnets corpus, the rhyming rule is clear "ABAB CDCD EFEF GG" and there are 7 groups of rhyming tokens. For the generated samples, we employ the CMU Pronouncing Dictionary 5 (Speech@CMU, 1998) to obtain the phonemes of the words in the rhyming positions. For example, the phonemes for word "asleep" and "steep" are ['AH0', 'S', 'L', 'IY1', 'P'] and ['S', 'T', 'IY1', 'P'] respectively. And then we can conduct the evaluation by counting the overlapping units from both the original words and the extracted phonemes group by group. We report the Macro-F1 and Micro-F1 numbers in the results tables as well.
Integrity Since the format in our task is strict and  rigid, thus the number of words to be predicted is also pre-defined. Our model must organize the language using the limited positions, thus sentence integrity may become a serious issue. For example, the integrity of "love is not love . /s " is much better than"love is not the . /s ". To conduct the evaluation of sentence integrity, we design a straightforward method by calculating the prediction probability of the punctuation characters before /s given the prefix tokens: log(P (y i punc |y i 0 ,y i 1 ,...,y i <punc )) (11) where Y is the generated sequence of sentences. Smaller integrity metric value indicates higher sentence quality. To achieve this goal, we conduct pre-trainings for two GPT2 (Radford et al., 2019) models on the large scale Chinese corpus and English corpus respectively. Then we utilize the GPT2 models to conduct the evaluation for sentence integrity. Human Evaluations For SongCi, we sampled 50 samples for 25 CiPais. For Sonnet, the whole 27 samples in the test set are selected for human evaluation. We recruit three helpers to score the Relevance, Fluency, and Style. The rating criteria are as follows: Relevance: +2: all the sentences are relevant to the same topic; +1: partial sentences are relevant; 0: not relevant at all. Fluency: +2: fluent; +1: readable but with some grammar mistakes; 0: unreadable. Style: +2: match with SongCi or Sonnet genres; +1: partially match; 0: mismatch.

Comparison Methods
S2S Sequence-to-sequence framework with attention mechanism . We regard the format and rhyme symbols C as the input sequence, and the target as the output sequence. GPT2 We fine-tune the GPT2 models (the pretraining versions are used for sentence integrity evaluation) on SongCi and Sonnet respectively. SongNet Out proposed framework with both the per-training and fine-tuning stages.
We also conduct ablation analysis to verify the performance of the defined symbols as well as the variants of model structures.
• SongNet (only pre-tuning) Without the finetuning stage. • SongNet (only fine-tuning) Without the pretraining stage. • SongNet-GRU Employ GRU  to replace Transformer as the core structure. • SongNet w/o C Remove the format and rhyme symbols C. • SongNet w/o P Remove the intra-position symbols P . • SongNet w/o S Remove the sentence segment symbols S. • SongNet w/ inverse-P Arrange the intraposition indices in ascending order instead of the descending order.    (e.g., 3,5,7) denotes the number of tokens in one sentence. The rhyming words are labeled in red color and italic font following is the Pinyin. (Since cases are provided to confirm the format consistency, thus we did not conduct translation for the Chinese samples. Translation for Chinese poetry is also a challenging task.)
though all thy love with thy hearts , thou still are lacking of my dead ; if thy love love is lost to your love and parts , and yet mine own heart can be buried . so many are ill or in tear, hath not this time that we will make their eye , for that which lies not well hath now appear, no longer nor the world that holds thee lie ! for if it would be buried in my live , or by the earth of mine was gone , then my own parts as my body and mine give , may not be so far beyond thine alone : so far as thee and this world view find thee , then mine life be far enough from all thee and no me . Table 6: Cases of the generated results given the formats with partial pre-defined content. Format token " " needs to be translated to real word token.

Results
Please note that we mainly employ top-k sampling method (Fan et al., 2018;Radford et al., 2019) to conduct the generation, and we let k = 32 here. The parameter tuning of k is described in Section 5.3. Table 1 and Table 2 depict the experimental results of SongNet as well as the baseline methods S2S and GPT2 on corpus SongCi and Sonnet respectively. It is obvious that our pre-training and fine-tuning framework SongNet obtain the best per-formance on most of the automatic metrics. Especially on the metric of Format accuracy, SongNet can even obtain a 98%+ value which means that our framework can conduct the generation rigidly matching with the pre-defined formats. On the metric of PPL, Rhyme accuracy, and sentence integrity, SongNet also performs significantly better in a large gap than the baseline methods such as S2S and GPT2 as well as the model variants only with the pre-training or fine-tuning stage.
Another observation is that some of the results on corpus Sonnet are not as good as the results  on SongCi. The main reason is that Sonnet only contains 100 samples in the training set as shown in Table 3. Therefore, the model cannot capture sufficient useful features especially for the rhyming issue.

Ablation Analysis
We conduct ablation study on corpus SongCi and the experimental results are depicted in Table 4. It should note that all the models are purely trained on SongCi corpus without any pre-training stages.
From the results we can conclude that the introduced symbols C, P , and S indeed play crucial roles in improving the overall performance especially on the metrics of format, rhyme, and sentence integrity. Even though some of the components can not improve the performance simultaneously on all the metrics, the combination of them can obtain the best performance.

Parameter Tuning
Since we employ top-k sampling as our main decoding strategy, thus we design several experiments to conduct the parameter tuning on k. We let k to be 1, 5, 10, 20, 50, 500 respectively. We also provide the beam-search (beam=5) results for comparing and reference. The parameter tuning results are depicted in Figure 3. From the results we can observe that large k can increase the diversity of the results significantly. But the Rhyme accuracy and the sentence integrity will drop simultaneously. Therefore, in the experiments we let k = 32 to obtain a trade-off between the diversity and the general quality.

Human Evaluation
For human evaluation, we just conduct the judging on the results generated by our final model SongNet. From the result we can observe that the results on corpus SongCi is much better than the ones on corpus Sonnet, which is because the corpus scale is different. And the the small scale also lead to dramatically dropping on all the metrics. Table 5 depicts several generated cases for SongCi and Sonnet respectively. For SongCi, the formats (CiPai) are all cold-start samples which are not in the training set or even newly defined. Our model can still generate high quality results on the aspects of format, rhyme as well as integrity. However, for corpus Sonnet, even though the model can generate 14 lines text, the quality is not as good as SongCi due to the insufficient training-set (only 100 samples). We will address this interesting and challenging few-shot issue in the future.

Case Analysis
In addition, we mentioned that our model has the ability of refining and polishing given the format C which contains some fixed text information. The examples of the generated results under this setting are shown in Table 6, which show that our model SongNet can generate satisfying results especially on SongCi.

Conclusion
We propose to tackle a challenging task called rigid formats controlled text generation. A pre-training and fine-tuning framework SongNet is designed to address the problem. Sets of symbols are tailordesigned to improve the modeling performance for format, rhyme, and sentence integrity. Extensive experiments conducted on two collected corpora demonstrate that our framework generates significantly better results in terms of both automatic metrics and human evaluations given arbitrary cold start formats.