Controlling Sequence-to-Sequence Models - A Demonstration on Neural-based Acrostic Generator

An acrostic is a form of writing that the first token of each line (or other recurring features in the text) forms a meaningful sequence. In this paper we present a generalized acrostic generation system that can hide certain message in a flexible pattern specified by the users. Different from previous works that focus on rule-based solutions, here we adopt a neural- based sequence-to-sequence model to achieve this goal. Besides acrostic, users are also allowed to specify the rhyme and length of the output sequences. Based on our knowledge, this is the first neural-based natural language generation system that demonstrates the capability of performing micro-level control over output sentences.


Introduction
Acrostic is a form of writing aiming at hiding messages in text, often used in sarcasm or to deliver private information. In previous works, English acrostic have been generated by searching for paraphrases in WordNet' s synsets (Stein et al., 2014). Synonyms that contain needed characters replace the corresponding words in the context to generate the acrostic. Nowadays Seq2Seq models have become a popular choice for text generation, including generating text from table (Liu et al., 2018), summaries (Nallapati et al., 2016), short-text conversations (Shang et al., 2015), machine translation (Bahdanau et al., 2015;Sutskever et al., 2014) and so on. In contrast to a rule-based or template-based generator, such Seq2Seq solutions are often considered more general and creative, as they do not rely heavily on pre-requisite knowledge or patterns to produce meaningful content. Although several works have presented automatic generation on rhymed text (Zhang and Lapata, 2014;Ghazvininejad et al., 2016), the works do not focus on controlling the rhyme of the generated content. One drawback of a neural-based Seq2Seq model is that the outputs are hard to control since the generation follows certain non-deterministic probabilistic model (or language model), which makes it non-trivial to impose a hard-constraint such as acrostic (i.e. micro-controlling the position of a specific token) and rhyme. In this work, we present an NLG system that allows the users to micro-control the generation of a Seq2Seq model without any post-processing. Besides specifying the tokens and their corresponding locations for acrostic, our model allows the users to choose the rhyme and length of the generated lines. We show that with simple adjustment, a Seq2Seq model such as the Transformer (Vaswani et al., 2017) can be trained to control the generation of the text. Our demo system focuses on Chinese and English lyrics, which can be regarded as a writing style in between articles and poetry. We consider a general version of acrostic writing, which means the users can arbitrarily choose the position to place acrostic tokens. The 2-minute demonstration video can be found at https://youtu. be/9tX6ELCNMCE.

Model Description
Normally a neural-based Seq2Seq model is learned using input/output sequences as training pairs (Nallapati et al., 2016;Cho et al., 2014a). By providing sufficient amount of such training pairs, it is expected that the model learns how to produce the output sequences based on the inputs. Here we would like to first report a finding that a Seq2Seq model is capable of discovering the hidden associations between inputting control signals and outputting sequences. Based on the finding we have created a demo system to show that the users can indeed guide the outputs of a Seq2Seq model in a fine-grained manner. In our demo, the users are allowed to control three aspects of the generated sequences: rhyme, sentence length and the positions of designated tokens. In other words, our Seq2Seq model not only is capable of generate next line satisfying the length and rhyme constraints provided by the user, it can also produce the exact word at a position specified by the user. The rhyme of a sentence is the last syllable of the last word in that sentence. The length of a sentence is the number of tokens in that sentence. To elaborate how our model is trained, we use three consecutive lines (denoted as S 1 , S 2 , S 3 ) of lyrics from the song "Rhythm of the Rain" as an example. Normally a Seq2Seq model is trained based on the following input/output pairs. S 1 : Listen to the rhythm of the falling rain → S 2 : Telling me just what a fool I've been S 2 : Telling me just what a fool I've been → S 3 : I wish that it would go and let me cry in vain With some experiments on training Seq2Seq models, we have discovered an interesting fact: By appending the control signals in the end of the input sequences, after seeing sufficient amount of such data, the Seq2Seq model can automatically discover the association between input signals and outputs. Once the associations are identified, then we can use the control signals to guide the output of the model. For instance, here we append additional control information to the end of the training sequence as below S 1 : Listen to the rhythm of the falling Telling me just what a fool I've been S 2 : Telling me just what a fool I've been || 2 wish 6 go || EY N || 12 → S 3 : I wish that it would go and let me cry in vain The three types of control signals are separated by "||". The first control signal indicates the position of the designated words. 1 T elling tells the system the token T elling should be produced at the first position of the output sequence s 2 . Similarly, 2 wish 6 go means that the second/sixth token in the output sequence shall be wish/go. The second control signal is the rhyme of the target sentence. For instance, IHN corresponds to a specific rhyme (/In/) and EY N corresponds to another (/en/). Note that here we use the formal name of the rhyme (e.g. EY N ) to improve readability. To train our system, any arbitrary symbol would work. The third part contains a digit (e.g. 8) to control the length of the output line.
By adding such additional information, Seq2Seq models can eventually learn the meaning of the control signal as they can produce outputs according to those signals with very high accuracy. Note that in our demo, all results are produced by our Seq2Seq model without any post-processing, nor do we provide any prerequisite knowledge about what length, rhyme or position really stands for to the model. We train our system based on the Transformer model (Vaswani et al., 2017), though additional experiments show that other RNN-based Seq2Seq models such as the one based on GRU (Cho et al., 2014b) or LSTM would also work. The model consists of an encoder and a decoder. Our encoder consists of two identical layers when training on Chinese lyrics and four identical layers when training on English lyrics. Each layer includes two sub-layers. The first is a multi-head attention layer and the second one is a fully connected feedforward layer. Residual connections (He et al., 2016) are implemented between the sub-layers. The decoder also consists of two identical layers when training on Chinese lyrics and four identical layers when training on English lyrics.. Each layer includes three sub-layers: a masked multi-head attention layer, a multi-head attention layer that performs attention over the output of encoder and a fully-connected feed-forward layer. The model structure is shown in Figure 1. Note that in the original paper (Vaswani et al., 2017), Transformer consists of six identical layers for both encoder and decoder. To save resource, we start training with fewer layers than the original paper and discover that the model still performs well. Thus, we use fewer layers than the proposed Transformer model. Figure 2 illustrates the interface and data flow of our acrostic lyric generating system. First, there are several conditions (or control signals) that can be specified by the users:

User Interface
• Rhyme: For Chinese lyrics, there are 33 different rhymes for users to choose from. As for English lyrics, there are 30 different rhymes for users to choose from.
• Theme of topic: The theme given by user is used to generate the zeroth sentence. In Chinese Acrostic demonstration, our system would pick a sentence from training set that is most similar to the user input, measured by the number of n-grams. As for English Acrostic demo, the user input of theme is directly used as the zeroth sentence.
• Length of each line: User can specify the length of every single line (separated by ;). For example, "5;6;7" means that the user wants to generate acrostic that contains 3 lines, with length equals to 5, 6, 7, respectively.
• The sequence of tokens to be hidden in the output sequences.
• Hidden Pattern: The exact positions for each token to be hidden. Apart from the common options, such as hiding in the first/last positions of each sentence or hiding in the diagonal positions, our system offers a more general and flexible way to define the pattern, realized through the Draw It M yself option. As shown in the bottom right corner of Figure 2, a table based on the length of each line specified by the users is created for the users to select the positions to place acrostic tokens.
The generation is done on the server side. After receiving the control signals provided by users, the server first uses the given theme to search for a related line (denoted as zeroth sequence) from the lyric corpus, based on both sentence-level matching and character-level matching. Then the given condition of first sentence is appended to this zeroth sequence to serve as initial input to the Seq2Seq model for generating first line of outputs. Next, the given condition of second sentence is appended to the generated first line as input to generate the second line. The same process is repeated until all lines are generated.

Data set
We have two versions: one training on Chinese lyrics and one on English lyrics.
The Chinese lyrics are crawled from Mojim lyrics site and NetEase Cloud. To avoid rare characters, the vocabulary size is set to the most frequent 50,000 characters. The English lyrics are crawled from Lyrics Freak. The vocabulary size is set to the most frequent 50,000 words. For each line of lyrics, we first calculate its length and then retrieve the rhyme of the last token. To generate the training pairs, we randomly append to the input sequence some tokens and their positions of the targeting sequence as the first control signal, followed by the rhyme and then length. Below are two example training pairs: S 1 : Listen to the rhythm of the f alling rain || 2 me 3 just || IY N || 8 → S 2 : Telling me just what a fool I've been S 2 : Telling me just what a fool I've been || 2 wish 6 go 7 and || EY N || 12 → S 3 : I wish that it would go and let me cry in vain In total there are about 651,339/1,000,000 training pairs we use to train our Chinese/English acrostic systems.

Evaluation
Our system has three controllable conditions on generating acrostic: the positions of designated tokens, the rhyme of each line and the length of each line. The evaluation set consists of 30,000 lines that are not included in training data. We first evaluate how accurate the control conditions can be satisfied. As shown in Table 1, the model can almost perfectly satisfy the request from users. We also evaluate the quality of learned language model for Chinese/English lyrics. The bi-gram perplexity of original training corpus is 54.56/53.2. The bi-gram perplexity of generated lyrics becomes lower (42.33/42.34), which indicates the language model does learn a better way to represent the lyrics data. In this experiment we find that training on English lyrics is harder than training on Chinese lyrics. English has strict grammatical rules while Chinese lyrics have more freedom in forming a sentence. We also observe that the model tends to generate sentences that use the same words that appear in their previous sentences. This behavior might be learned from the repetition of lyrics lines.

Demonstration of Results
We provide our system outputs from different aspects.
The first example in Figure 3 shows that we can control the length of each line to produce a triangle-shaped lyrics. Second, we would like to demonstrate the results in generating acrostic. Some people use acrostic to hide message that has no resemblance with the content of the full text. We would show both English and Chinese examples generated by our system. Figure 4 shows hiding a sentence in the first word of each sentences. The sentence that being concealed in the lyrics is I don ′ t like you, which is very different from the meaning of the full lyrics.   Figure 5 shows a Chinese acrostic generated by our system. We hide a message 甚麼都可以藏 (Anything can be hidden) in the diagonal line of a piece of lyrics that talks about relationship and dream.
Third, we can also play with the visual shape of the designated words. Figure 6 shows an example of hiding a sentence in the shape of diamond in the generated lyrics. The message being concealed is be the change you wish to see in the world. Figure 7 shows that we can hide the message using the shape of a heart. Figure 7: The designated characters form a heart. The sentence hidden in the lyrics is 疏影橫斜水清淺暗香 浮動月黃昏 (The shadow reflects on the water and the fragrance drifts under the moon with the color of dusk) with rhyme i.

Conclusion
We show that by appending additional information in the training input sequences, it is possible to train a Seq2Seq model whose outputs can be controlled in a fine-grained level. This finding enables us to design and demonstrate a general acrostic generating system with various features controlled, including the length of each line, the rhyme of each line and the target tokens to be produced and their corresponding positions. Our results have shown that the proposed model not only is capable of generating meaningful content, it also follows the constraints with very high accuracy. We believe that this finding can further lead to other useful applications in natural language generation.