Poetry to Prose Conversion in Sanskrit as a Linearisation Task: A Case for Low-Resource Languages

The word ordering in a Sanskrit verse is often not aligned with its corresponding prose order. Conversion of the verse to its corresponding prose helps in better comprehension of the construction. Owing to the resource constraints, we formulate this task as a word ordering (linearisation) task. In doing so, we completely ignore the word arrangement at the verse side. kāvya guru, the approach we propose, essentially consists of a pipeline of two pretraining steps followed by a seq2seq model. The first pretraining step learns task-specific token embeddings from pretrained embeddings. In the next step, we generate multiple possible hypotheses for possible word arrangements of the input %using another pretraining step. We then use them as inputs to a neural seq2seq model for the final prediction. We empirically show that the hypotheses generated by our pretraining step result in predictions that consistently outperform predictions based on the original order in the verse. Overall, kāvya guru outperforms current state of the art models in linearisation for the poetry to prose conversion task in Sanskrit.


Introduction
Prosody plays a key role in the word arrangement in Sanskrit Poetry. The word arrangement in a verse should result in a sequence of syllables which adhere to one of the prescribed meters in Sanskrit Prosody (Scharf et al., 2015). As a result, the configurational information of the words in a verse is not aligned with its verbal cognition (Bhatta, 1990;Dennis, 2005). Obtaining the proper word ordering, called as the prose ordering, from a verse is often considered a task which requires linguistic expertise (Shukla et al., 2016;Kulkarni et al., 2015). * Work done while the author was at IIT Kharagpur In this work, we use neural sequence generation models for automatic conversion of poetry to prose. Lack of sufficient poetry-prose parallel data is an impediment in framing the problem as a seq2seq task (Gu et al., 2018). 1 Hence, we formulate our task as that of a word linearisation task (He et al., 2009). In linearisation, we arrange a bag of words into a grammatical and fluent sentence (Liu et al., 2015). This eliminates the need for parallel data, as the poetry order is not anymore relevant at the input. A neural-LM based model from Schmaltz et al. (2016) and a seq2seq model form Wiseman and Rush (2016) are the current state of the art (SOTA) models in the linearisation task.
We first show that a seq2seq model with gated CNNs (Gehring et al., 2017), using a sequence level loss (Edunov et al., 2018) can outperform both the SOTA models for the Sanskrit poetry linearisation task. But using a seq2seq model brings non-determinism to the model as the final prediction of the system is dependent on the order at which the words are input to the encoder (Vinyals et al., 2016). We resolve this, by using a pretraining approach  to obtain an initial ordering of the words, to be fed to the final model. This approach consistently performs better than using the original poetry order as input. Further, we find that generating multiple hypotheses 2 using this component , to be fed to the final seq2seq component, results in improving the results by about 8 BLEU points. Additionally, we use a pretraining approach to learn task specific word embeddings by combining multiple word embeddings (Kiela et al., 2018). We call our final configuration as kāvya guru. 'kāvya guru' is a compound word in Sanskrit, which roughly translates to 'an expert in prosody'.

Poetry to Prose as Linearisation
Given a verse sequence x 1 , x 2 ......x n , our task is to rearrange the words in the verse to obtain its prose order. As shown in Figure 2, kāvya guru takes the Bag of Words (BoW) S as the input to the system. We use two pretraining steps prior to the seq2seq component in our approach. The first step, 'DME', combines multiple pretrained word embeddings, say {w 11 , w 12 , w 13 } for a token x 1 ∈ S, into a single meta-embedding, w DM E 1 . The second component, 'SAWO', is a linearisation model in itself, which we use to generate multiple hypotheses, i.e., different permutations of the tokens, to be used as input to the final 'seq2seq' component.

Pretraining
Step 1 -Dynamic Meta Embeddings (DME): Given a token x i ∈ S, we obtain r different pre-trained word embeddings, represented as {w i1 , w i2 ....w ir }. Following Kiela et al. (2018), we learn a single task specific embedding, w DME i using weighted sum of all the r embeddings. The scalar weights for combining the embeddings are learnt using self-attention, with a training objective to minimise the negative log likelihood of the sentences, given in the prose order.

Pretraining
Step 2 -Self-Attention Based Word-Ordering (SAWO): SAWO allows us to generate multiple permutations of words as hypotheses, which can be used as input to a seq2seq model. Here, we use a word ordering model itself as a pretraining step, proposed in . From step 1, we obtain the DME embed- one each for each token in S. For each token in S, we also learn additional embeddings, {sa 1 , sa 2 , ....sa n }, using the self-attention mechanism. These additional vectors are obtained using the weighted sum of all the DME embeddings in the input BoW S, where the weights are learned using the selfattention mechanism Vaswani et al., 2017). As shown in Figure 2, the DME vector w DME i and the vector sa i are then concatenated to form a representation for the token X i . The concatenated vectors so obtained for all the tokens in S, form the input to the decoder.
We use an LSTM based decoder, initialised with the average of DME embeddings of all the tokens ({w DM E 1 , w DM E 2 , ....w DM E n }) at the input. A special token is used as the input in the first time-step, and based on the predictions from the decoder, the concatenated vectors are input in the subsequent time-steps. The decoder is constrained to predict from the list of words in BoW, which are not yet predicted at a given instance. We use a beamsearch based decoding strategy (Schmaltz et al., 2016) to obtain top-k hypotheses for the system.
For both the pretraining steps, the training objective is to minimise the negative log likelihood of the ground truth (prose order sentences), and both the components are trained jointly. The multiple hypotheses so generated are used as independent inputs to the seq2seq model, with the prose order as their corresponding ground truth for training. In the figure 2, we show only one hypothesis from SAWO. This helps us to obtain a k-fold in-crease in the amount of available training data.
The seq2seq model: We use the seq2seq model comprising of gated CNNs (Gehring et al., 2017) for the task. Our training objective is a weighted combination of the expected risk minimisation (RISK) and the token level negative log likelihood with label smoothing (T okLS) (Edunov et al., 2018). Here, we use a uniform prior distribution over the vocabulary for label smoothing. RISK minimises the expected value of a given cost function, BLEU in our case, over the space of candidate sequences. (1) Here U is the candidate set, with |U|= 16 and the sequences in U are obtained using Beam Search. The size for the beam search was determined empirically. 3ŷ is the reference target sequence, i.e., the prose. x is the input sequence to the model, which is obtained from SAWO. In Wiseman and Rush (2016), we constrain the prediction of tokens to those available at the input during testing.
Majority Vote Policy: For an input verse, SAWO generates multiple hypotheses and seq2seq then predicts a sequence corresponding to each of these, of the same size as the input. To get a single final output, we use a 'Majority Vote' policy. For each position, starting from left, we find the token which was predicted the most number of times at that position among all the seq2seq outputs, and choose it as the token in the final output.

Experiments
Dataset: We obtain 17,017 parallel poetry-prose data from the epic "Rāmāyan . a''. 4 Given that about 90 % of the vocabulary appears less than 5 times in the corpus, we use BPE to learn a new vocabulary (Sennrich et al., 2016). We add about 95,000 prose-order sentences from Wikipedia into our training data, as the poetry order input is irrelevant for linearisation. 5 Data Preparation: With a vocabulary of 12,000, we learn embeddings for the BPE entries using Word2vec (Mikolov et al., 2013), Fast-Text (Bojanowski et al., 2017), and character embeddings from Hellwig and Nehrdich (2018). The embeddings were trained on 0.8 million sentences (6.5 million tokens) collected from multiple corpora including DCS (Hellwig, 2011), Wikipedia and Vedabase 6 . Finally, we combine the word embeddings using DME (Kiela et al., 2018).
From the set of 17,017 parallel poetry-prose corpus, we use 13,000 sentence pairs for training, 1,000 for validation and the remaining 3,017 sentence pairs for testing. The sentences in test data are not used in any part of training or for learning the embeddings.
Evaluation Metrics: Linearisation tasks are generally reported using BLEU (Papineni et al., 2002) score (Hasler et al., 2017;Belz et al., 2011). Additionally, we report Kendall's Tau (τ ) and perfect match scores for the models. Perfect match is the fraction of sentences where the prediction matches exactly with the ground truth. Kendall's Tau (τ ) is calculated based on the number of inversions needed to transform a predicted sequence to the ordering in the reference sequence. τ is used as a metric in sentence ordering tasks (Lapata, 2006), and is defined as 1 m m i=1 1 − 2 × inversions count/ n 2 (Logeswaran et al., 2018;Lapata, 2003). In all these three metrics, a higher score always corresponds to a better performance of the system.

LSTM Based Linearisation Model (LinLSTM):
LinLSTM is an LSTM based neural language model (LM) proposed by Schmaltz et al. (2016). Sequences in sentence/prose order are fed to the system for learning the LM. Beam search, constrained to predict only from the bag of words given as input, is used for decoding. The authors obtained SOTA results in their experiments on the Penn Treebank, even outperforming different syntax based linearisation models (Zhang and Clark, 2015;Zhang, 2013). The best result for the model was obtained using a beam size of 512, and we use the same setting for our experiments.

Seq2Seq with Beam Search Optimisation (BSO):
The seq2seq model uses a max-margin approach with a search based loss, designed to penalise the errors made during beam search (Wiseman and Rush, 2016). Here scores for different possible sequences are predicted and then they are ranked using beam search. The loss penalises the function when the gold sequence falls off the beam during training. For our experiments, we use a beam size of 15 for testing and 14 for training, the setting with best reported scores in Wiseman and Rush (2016).

Results
Table 1a provides the results for all the three systems under different settings. kāvya guru reports the best results with a BLEU score of 55.26, outperforming the baselines. We apply both the pretraining components and the 'Majority Vote' policy ( §2) to both the seq2seq models, i.e. 'BSO' and the proposed model 'kāvya guru'. From Table 1a, it is evident that infusing proseonly training data from Wikipedia, and applying both the pretraining steps leads to significant 7 and consistent improvements for both the seq2seq models. LinLSTM shows a decrease in its performance when the dataset is augmented with sentences from Wikipedia. We obtain the best results for kāvya guru when self-attention 7 For all the reported results, we use approximate randomisation approach for significance tests. All the reported values have a p-value < 0.02 was added to the seq2seq component of the model (Edunov et al., 2018;Paulus et al., 2018) (final row in Table 1a). Table 1c shows that the textencoding/transliteration scheme in which a sequence is represented affects the results. kāvya guru performs the best when it uses syllable level encoding of input, as compared to character level transliteration schemes such as IAST 8 or SLP1 9 .
Effect of increase in training set size due to SAWO: Using SAWO, we can generate multiple word order hypotheses as the input to the seq2seq model. Results from Table 1b show that generating multiple hypotheses leads to improvements in the system performance. 7 It might be puzzling that kāvya guru contains two components, i.e. SAWO and seq2seq, where both of them perform essentially the same task of word ordering. This might create an impression of redundancy in kāvya guru. But, a configuration that uses only the DME and SAWO (without the seq2seq), results in a BLEU score of 33.8 as against 48.26 for kāvya guru (Table 1b, k = 1). Now, this brings the validity of SAWO component into question. To check this, instead of generating hypotheses using SAWO, we used 100 random permutations 10 for a given sentence as input to the seq2seq component. The 8 https://en.wikipedia.org/wiki/ International_Alphabet_of_Sanskrit_ Transliteration 9 https://en.wikipedia.org/wiki/SLP1 10 Empirically decided from 1 to 100 random permutations with a step size of 10 first 3 rows of BSO and kāvya guru in Table 1a show the results for non-SAWO configurations. These configurations do not outperform SAWO based configurations, in spite of using as many as 10 times the candidates than those used in SAWO based configuration. For SAWO (non-SAWO), we find that the system performances tend to saturate with number of hypotheses greater than 10 (100).
Effect of using word order in the verse at inference: During inference, the test-set sentences are passed as input in the verse order to each of the kāvya guru configurations in Table 1a. kāvya guru+DME configuration achieves the best result for this. But here also, the system performance drops to τ = 68.92 and BLEU = 45.63, from 70.8 and 48.33, respectively. To discount the effect of majority vote policy used in SAWO, we consider predictions based on individual SAWO hypotheses. However, even the lowest τ score (70.61), obtained while using the 10th ranked hypothesis from SAWO, outperforms the predictions based on the verse order. 7

Conclusion
In this work, we attempt to address the poetry to prose conversion problem by formalising it as an LM based word linearisation task. We find that kāvya guru outperforms the state of the art models in word linearisation for the task. Though tremendous progress has been made in digitising texts in Sanskrit, they still remain inaccessible largely due to lack of specific tools that can address linguistic peculiarities exhibited by the language (Krishna et al., 2017). From a pedagogical perspective, it will be beneficial for learners of the language to look into the prose of the verses for an easier comprehension of the concepts discussed in the verse.