Shakespearizing Modern Language Using Copy-Enriched Sequence to Sequence Models

Variations in writing styles are commonly used to adapt the content to a specific context, audience, or purpose. However, applying stylistic variations is still by and large a manual process, and there have been little efforts towards automating it. In this paper we explore automated methods to transform text from modern English to Shakespearean English using an end to end trainable neural model with pointers to enable copy action. To tackle limited amount of parallel data, we pre-train embeddings of words by leveraging external dictionaries mapping Shakespearean words to modern English words as well as additional text. Our methods are able to get a BLEU score of 31+, an improvement of ≈ 6 points above the strongest baseline. We publicly release our code to foster further research in this area.


Introduction
Text is often morphed using a variety of lexical and grammatical transformations, adjusting the degree of formality, usage of catchy phrases, and other such stylistic changes to make it more appealing.Moreover, different text styles appeal to different user segments (Saha Roy et al., 2015) (Kitis, 1997) (Schwartz et al., 2013).Thus there is a need to quickly adapt text to different styles.However, manually transforming text to desired style can be a tedious process.
There have been increased efforts towards machine assisted text content creation and editing through automated methods for summarization (Rush et al., 2015) , brand naming (Hiranandani et al., 2017), text expansion (Srinivasan et al., 2017), etc.However, there is a dearth of automated solutions for adapting text quickly to differ- *  ent styles.We consider the problem of transforming text written in modern English text to Shakepearean style English.For the sake of brevity and clarity of exposition, we henceforth refer to the Shakespearean sentences/side as Original and the modern English paraphrases as Modern.

* denotes equal contribution authors
Unlike traditional domain or style transfer, our task is made more challenging by the fact that the two styles employ diachronically disparate registers of English -one style uses the contemporary language while the other uses Early Modern English 1 from the Elizabethan Era (1558-1603).Although Early Modern English is not classified as a different language (unlike Old English and Middle English), it does have novel words (acknown and belike), novel grammatical constructions (two second person forms -thou (informal) and you (formal) (Brown et al., 1960)), semantically drifted senses (e.g fetches is a synonym of excuses) and non-standard orthography (Rayson et al., 2007).Additionally there is a domain difference since the Shakespearean play sentences are from a drama whereas the parallel modern English sentences are meant to be simplified explanation for high-school students.
Prior works in this field leverage language model for target style, achieving transformation either using phrase tables (Xu et al., 2012), or by inserting relevant adjectives and adverbs (Saha Roy et al., 2015).Such works have limited accuracy and scope in the type of transformations that can be achieved.Moreover, statistical and rule MT based systems do not provide a direct mechanism to a) share word representation information between source and target sides b) incorporating constraints between words into word representations in end-to-end fashion.Neural sequence-tosequence models, on the other hand, provide such flexibilty (Faruqui et al., 2014).
Our main contributions are as follows: • We use a sentence level sequence to sequence neural model with a pointer network component to enable direct copying of words from input.We demonstrate that this method performs much better than prior phrase translation based approaches for transforming Modern English text to Shakespearean English.
• We pre-train word embedding considering external dictionary of words.The pretrained  Rest of the paper is organized as follows.We first provide a brief analysis of our dataset in ( §2).We then elaborate on details of our methods in ( §3, §4, §5, §6).We then discuss experimental setup and baselines in ( §7).Thereafter, we discuss the results and observations in ( §8).We conclude with discussions on related work ( §9) and future directions ( §10).

Dataset
Our dataset is a collection of line-by-line modern paraphrases for 16 of Shakespeare's 36 plays (Antony & Cleopatra, As You Like It, Comedy of Errors, Hamlet, Henry V etc) from the educational site Sparknotes2 .This dataset was compiled by Xu et al. (2014;2012) and is freely available on github.314 plays covering 18,395 sentences form the training data split.We kept 1218 sentences from the play Twelfth Night as validation data set.The last play, Romeo and Juliet, comprising of 1462 sentences, forms the test set.

Examples
Table 1 shows some parallel pairs from the test split of our data, along with the corresponding target outputs from some of our models.Copy and SimpleS2S refer to our best performing attentional S2S models with and without a Copy component respectively.Stat refers to the best statistical machine translation baseline using off-theshelf GIZA++ aligner and MOSES.We can see through many of the examples how direct copying from the source side helps the Copy generates better outputs than the SimpleS2S.The approaches are described in greater detail in ( §3) and ( §7).

Analysis
Table 2 shows some statistics from the training split of the dataset.In general, the Original side has longer sentences and a larger vocabulary.The slightly higher entropy of the Original side's frequency distribution indicates that the frequencies are more spread out over words.Intuitively, the large number of shared word types indicates that sharing the representation between Original and Modern sides could provide some benefit.

Method Overview
Overall architecture of the system is shown in Figure ( §1).We use a bidirectional LSTM to encode the input modern English sentence.Our decoder side model is a mixture model of RNN module amd pointer network module.The two individual modules share the attentions weights over encoder states, although it is not necessary to do so.The decoder RNN predicts probability distribution of next word over the vocabulary, while pointer model predicts probability distribution over words in input.The two probabilities undergo a weighted addition, the weights themselves computed based on previous decoder hidden state and the encoder outputs.
Let x, y be the some input -output sentence pair in the dataset.Both input x as well as output y are sequence of tokens.x = x 1 x 2 ...x Tenc , where T enc represents the length of the input sequence x.Similarly, y = y 1 y 2 ...y T dec .Each of x i , y j is be a token from the vocabulary.

Token embeddings
Each token in vocabulary is represented by a M dimensional embedding vector.Let vocabulary V be the union of modern English and Shakepearean vocabularies i.e.V = V shakespeare ∪ V modern .E enc and E dec represent the embedding matrices used by encoder and decoder respectively ( E enc , E dec ∈ R |V |×M ).We consider union of the vocabularies for both input and output embeddings because many of the tokens are common in two vocabularies, and in the best performing setting we share embeddings between encoder and decoder models.Let E enc (t), represent encoder side embeddings of some token t.For some input sequence x, E enc (x) is given as (E enc (x 1 ), E enc (x 2 ), ...).

Pretraining of embeddings
Learning token embeddings from scratch in an end-to-end fashion along with the model greatly increases the number of parameters.To mitigate this, we consider pretraining of the token embeddings.We pretrain our embeddings on all training sentences.We also experiment with adding additional data from PTB (Marcus et al., 1993) for better learning of embeddings.Additionally we leverage a dictionary mapping tokens from Shakespearean English to modern English.
We consider four distinct strategies to train the embeddings.In the cases where we use external text data, we first train the embeddings using both the external data and training data, and then for the same number of iterations on training data alone, to ensure adaptation.Note that we do not directly use off-the-shelf pretrained embeddings such as Glove (Pennington et al., 2014) and Word2Vec (Mikolov et al., 2013) since we need to learn embeddings for novel word forms (and also different word senses for extant word forms) on the Original side.

Plain
This method is the simplest pre-training method.Here, we do not use any additional data, and train word embeddings are trained on the union of Modern and Original sentences.

PlainExt
In this method, we add all the sentences from the external text source (PTB) in addition to sentences in training split of our data.

Retro
We leverage a dictionary L of approximate Original → Modern word pairs (Xu et al., 2012;Xu, 2014), crawled from shakespeare-words.com, a source distinct from Sparknotes.We explicitly add the two 2nd persons and their corresponding forms (thy, thou, thyself etc) which are very frequent but not present in L. The final dictionary we use has 1524 pairs.Faruqui et al (2014) proposed a retrofitting method to update a set of word embeddings to incorporate pairwise similarity constraints.Given a set of embeddings p i ∈ P , a vocabulary V , and a set C of pairwise constraints (i, j) between words, retrofitting tries to learn a new set of embeddings q i ∈ Q to minimize the following objective: We use their off-the-shelf implementation 4 to encode the dictionary constraints into our pretrained embeddings, setting C = L and using suggested default hyperparameters for δ, ω and number of iterations.

RetroExt
This method is similar to Retro, except that we use sentences from the external data (PTB) in addition to training sentences.We use None to represent the settings where we do not pretrain the embeddings.

Fixed embeddings
Fine-tuning pre-trained embeddings for a given task may lead to overfitting, especially in scenarios with small amount of supervised data for the task (Madhyastha et al., 2015).This is because embeddings for only a fraction of vocabulary items get updated, leaving the embeddings unchanged for many vocabulary items.To avoid this, we consider fixed embeddings pretrained as per procedures described earlier.While reporting results in Section ( §8), we separately report results for fixed (FIXED) and trainable (VAR) embeddings, and observe that keeping embeddings fixed leads to better performance.

Method Description
In this section we give details of the various modules in the proposed neural model.∈ R H ). The following equations describe the model: We use addition to combine the forward and backward encoder states, rather than concatenation which is standardly used, since it doesn't add extra parameters, which is important in a low-data scenario such as ours.

Attention
Let h dec t represent the hidden state of the decoder LSTM at step t.Let E dec (y t−1 ) represent the decoder side embeddings of previous step output.We use special ST ART symbol at t = 1.
We first compute a query vector, which is a linear transformation of h dec t−1 .A sentinel vector s ∈ R H is concatenated with the encoder states to create F att ∈ R (Tenc+1)×H , where T enc represents the number of tokens in encoder input sequence x.A normalized attention weight vector α norm is computed.The value g, which corresponds to attention weight over sentinel vector, represents the weight given to the decoder RNN module while computing output probabilties.

Pointer model
As pointed out earlier, a pair of corresponding Original and Modern sentences have significant vocabulary overlap.Moreover, there are lot of proper nouns and rare words which might not be predicted by a sequence to sequence model.To rectify this, pointer networks have been used to enable copying of tokens from input directly (Merity et al., 2016).The pointer module provides location based attention, and output probability distribution due to pointer network module can be expressed as follows:

Decoder RNN
Summation of encoder states weighed by corresponding attention weights yields context vector.Output probabilities over vocabulary as per the decoder LSTM module are computed as follows: During training, we feed the ground truth for y t−1 , whereas while making predictions on test data, predicted output from previous step is used instead.

Output prediction
Output probability of a token w at step t is a weighted sum of probabilities from decoder LSTM model and pointer model given as follows: P P T R t (w) takes a non-zero value only if w occurs in input sequence, otherwise it is 0. Forcing g = 0 would correspond to not having a Copy component, reducing the model to a plain attentional S2S model, which we refer to as a Sim-pleS2S model.

Loss functions
Cross entropy loss is used to train the model.For a data point (x, y) ∈ D and predicted probability distributions P t (w) over the different words w ∈ V for each time step t ∈ {1, . . ., T dec }, the loss is given by Sentinel Loss (SL): Following from work by (Merity et al., 2016), we consider additional sentinel loss.This loss function can be considered as a form of supervised attention.Sentinel loss is given as follows: We report the results demonstrating the impact of including the sentinel loss function (+SL).

Experiments
In this section we describe the experimental setup and evaluation criteria used.

Preprocessing
We lowercase sentences and then use NLTK's PUNKT tokenizer to tokenize all sentences.The Original side has certain characters like aewhich are not extant in today's language.We map these characters to the closest equivalent character(s) used today (e.g ae→ ae)

As-it-is
Since both source and target side are English, just replicating the input on the target side is a valid and competitive baseline, with a BLEU of 21+.

Dictionary
Xu et al. ( 2012) provide a dictionary mapping between large number of Shakespearean and modern English words.We augment this dictionary with pairs corresponding to the 2nd person thou (thou, thy, thyself ) since these common tokens were not present.
Directly using this dictionary to perform wordby-word replacement is another admittable baseline.As was noted by Xu et al. (2012), this baseline actually performs worse than As-it-is.This could be due to its performing aggressive replacement without regard for word context.Moreover, a dictionary cannot easily capture one-to-many mappings as well as long-range dependencies 5 .

Off-the-shelf SMT
To train statistical machine translation (SMT) baselines, we use publicly available open-source toolkit MOSES (Koehn et al., 2007), along with the GIZA++ word aligner (Och, 2003), as was done in (Xu et al., 2012).For training the targetside LM component, we use the lmplz toolkit within MOSES to train a 4-gram LM.We also use MERT (Och, 2003), available as part of MOSES, to tune on the validation set.
For fairness of comparison, it is necessary to use the pairwise dictionary and PTB while training the SMT models as well -the most obvious way for this is to use the dictionary and PTB as additional training data for the alignment component and the target-side LM respectively.We experiment with several SMT models, ablating for the use of both PTB and dictionary.In 8, we only report the performance of the best of these approaches, with more detailed results covered in Appendix.

Evaluation
Our primary evaluation metric is BLEU (Papineni et al., 2002) .We compute BLEU using the freely available and very widely used perl script 6 from the MOSES decoder.
We also report PINC (Chen and Dolan, 2011), which originates from paraphrase evaluation literature and evaluates how much the target side paraphrases resemble the source side.Given a source sentence s and a target side paraphrase c generated by the system, PINC(s,c) is defined as 5 thou-thyself and you-yourself 6 http://tinyurl.com/yben45gm where N gram(x, n) denotes the set of n-grams of length n in sentence x, and N is the maximum length of ngram considered.We set N = 4. Higher the PINC, greater the novelty of paraphrases generated by the system.Note, however, that PINC does not measure fluency of generated paraphrases.

Training and Parameters
We use a minibatch-size of 32 and the ADAM optimizer (Kingma and Ba, 2014) with learning rate 0.001, momentum parameters 0.9 and 0.999, and = 10 −8 .All our implementations are written in Python using Tensorflow 1.1.0framework.
For every model, we experimented with two configurations of embedding and LSTM size -S (128-128), M E (192-192) and L (256-256).Across models, we find that the M E configuration performs better in terms of highest validation BLEU.We also find that larger configurations (384-384 & 512-512) fail to converge or perform very poorly7 .Here, we report results only for the M E configuration for all the models.For all our models, we picked the best saved model over 15 epochs which has the highest validation BLEU.

Decoding
At test-time we use greedy decoding to find the most likely target sentence 8 .We also experiment with a post-processing strategy which replaces UNKs in the target output with the highest aligned (maximum attention) source word.We find that this gives a small jump in BLEU of about 0.1-0.2 for all neural models9 .Our best model, for instance, gets a jump of 0.14 to reach a BLEU of 31.26 from 31.12.

Results
The results in Table 3 confirm most of our hypotheses about the right architecture for this task.
• Copy component: We can observe from Table 3 that the various Copy models each outperform their SimpleS2S counterparts by atleast 7-8 BLEU points.
• Retrofitting dictionary constraints: The Retro configurations generally outperform their corresponding Plain configurations.For instance, our best configuration Copy.Yes.RetroExtFixed gets a better BLEU than Copy.Yes.PlainExtFixed by a margin of atleast 11.
• Sharing Embeddings: Sharing source and target side embeddings benefits all the Retro configurations, although it slightly deteriorates performance (about 1 BLEU point) for some of the Plain configurations.
• Fixing Embeddings: Fixed configurations always perform better than corresponding Var ones (save some exceptions).For instance, Copy.Yes.RetroExtFixed get a BLEU of 31.12 compared to 20.95 for Copy.Yes.RetroExtVar.Due to fixing embeddings, the former has just half as many parameters as the latter (5.25M vs 9.40M) • Effect of External Data: Pretraining with external data Ext works well along with retrofitting Retro.
For instance, Copy.Yes.RetroExtFixed gets a BLEU improvement of 2+ points over Copy.Yes.RetroFixed

• Effect of Pretraining:
For the Sim-pleS2S models, pre-training adversely affects BLEU.However, for the Copy models, pre-training leads to improvement in BLEU.The simplest pretrained Copy model, Copy.No.PlainVar has a BLEU score 1.8 higher than Copy.No.NoneVar.
• PINC scores: All the neural models have higher PINC scores than the statistical and dictionary approaches, which indicate that the target sentences produced differ more from the source sentences than those produced by these approaches.
• Sentinel Loss: Adding the sentinel loss does not have any significant effect, and ends up reducing BLEU by a point or two, as seen with the Copy+SL configurations.

Related Work
There have been some prior work on style adaptation.Xu et al. (2012) use phrase table based statistical machine translation to transform text to target style.On the other hand our method is an endto-end trainable neural network.Saha Roy et al (2015) leverage different language models based on geolocation and occupation to align a text to specific style.However, their work is limited to addition of adjectives and adverbs.Our method can handle more generic tranformations including addition and deletion of words.
Pointer networks (Vinyals et al., 2015) allow the use of input-side words directly as output in a neural S2S model, and have been used for tasks like extractive summarization (See et al., 2017) (Zeng et al., 2016) and question answering (Wang and Jiang, 2016).However, pointer networks cannot generate words not present in the input.A mixture model of recurrent neural network and pointer network has been shown to achieve good performance on language modeling task (Merity et al., 2016).S2S neural models, first proposed by Sutskever et al. (2014), and enhanced with a attention mechanism by Bahdanau et al. (2014), have yielded state-of-the-art results for machine translation (MT), , summarization (Rush et al., 2015), etc.In the context of MT, various settings such as multi-source MT (Zoph and Knight, 2016) and MT with external information (Sennrich et al., 2016) have been explored.Distinct from all of these, our work attempts to solve a modern → Shakespearean style English style transformation task.Although closely related to both paraphrasing and MT, our task has some differentiating characteristics such as considerable source-target overlap in vocabulary and grammar (unlike MT), and different source and target language (unlike paraphrasing).Gangal et al. (2017) have proposed a neural sequence-to-sequence solution for generating portmanteau from two English words.Though their task also involves large overlap in target and input, they do not employ any special copying mechanism.Unlike text simplification and summarization, our task does not involve shortening content length.

Conclusion
In this paper we have proposed to use a mixture model of pointer network and LSTM to transform Modern English text to Shakespearean style English.We demonstrate the effectiveness of our proposed approaches over the baselines.Our experiments reveal the utility of incorporating input-copying mechanism, and using dictionary constraints for problems with shared (but non-identical) source-target sides and sparse parallel data.
We have demonstrated the transformation to Shakespearean style English only.Methods have to be explored to achieve other stylistic variations corresponding to formality and politeness of text, usage of fancier words and expressions, etc.We release our code publicly to foster further research on stylistic transformations on text.11 .

Figure 1 :
Figure 1: Pictorial depiction of our overall architecture at decoder step 3. Attention weights are computed using previous decoder hidden state h 2 , encoder representations, and sentinel vector.Attention weights are shared by decoder RNN and pointer models.The final probability distribution over vocabulary comes from both the decoder RNN and the pointer network.Similar formulation is used over all decoder steps

Figure 2 :
Figure 2: Attention matrices from a Copy (top) and a simple S2S (bottom) model respectively on the input sentence "Holy Saint Francis, this is a drastic change!" .< s > and < /s > are start and stop characters.Darker cells are higher-valued.

Table 2
8.1 Qualitative AnalysisFigure2shows the attention matrices from our best Copy model (Copy.Yes.RetroExtFixed) and our best SimpleS2S model (Sim-pleS2S.Yes.Retrofixed) respectively for the same input test sentence.Without an explicit Copy component, the SimpleS2S model cannot predict the words saint and francis, and drifts off after predicting incorrect word flute.

Table 3 :
Test BLEU results.Sh denotes encoderdecoder embedding sharing (No=×,Yes= ) .Init denotes the manner of initializing embedding vectors.The -Fixed or -Var suffix indicates whether embeddings are fixed or trainable.COPY and SIMPLES2S denote presence/absence of Copy component.+SL denotes sentinel loss.Table 1 presents model outputs 10 for some test examples.In general, the Copy model outputs re-10 All neural outputs are lowercase due to our preprocessing.Although this slightly affects BLEU, it helps prevent token occurrences getting split due to capitalization.