Sanskrit Word Segmentation Using Character-level Recurrent and Convolutional Neural Networks

The paper introduces end-to-end neural network models that tokenize Sanskrit by jointly splitting compounds and resolving phonetic merges (Sandhi). Tokenization of Sanskrit depends on local phonetic and distant semantic features that are incorporated using convolutional and recurrent elements. Contrary to most previous systems, our models do not require feature engineering or extern linguistic resources, but operate solely on parallel versions of raw and segmented text. The models discussed in this paper clearly improve over previous approaches to Sanskrit word segmentation. As they are language agnostic, we will demonstrate that they also outperform the state of the art for the related task of German compound splitting.


Introduction
Sanskrit is an Indo-Aryan language that served as lingua franca for the religious, scientific and literary communities of ancient India. Text production in Sanskrit started in the 2. millenium BCE and has continued until today. 1 A 19th century cataloguing project recorded more than 40,000 Sanskrit texts known at that time (Aufrecht, 1891(Aufrecht, -1903, which covers only a small part of the extant Sanskrit literature. Apart from the oldest Vedic texts, Sanskrit has little diachronic variation on the morphological level, because it was regularized by the grammarian Pān . ini in the 3rd c. BCE. NLP of Sanskrit is challenging due to compounding (see Ex. 1) and the phonetic processes called Sandhi ('connection'; see . Compounding is widely used in other languages, and NLP has developed methods for analyzing compounds (Macherey et al., 2011;. In 1 Text production was oral until the first centuries BCE (Falk, 1993). The texts were transmitted by memorization in this period, making them less (!) prone to transmission errors than in written form.
Sanskrit, however, syntactic co-and subordination tend to be diachronically replaced by compounding (Lowe 2015; see also Sec. 3), so that many sentences in later literature consist only of a few long compounds that are loosely connected by a semantically light verb or an (optional) copula, as shown in this example: (1)āśrayabhūtakhādikathanena foundation-become-air-etc.-mentioning "(Something is described) by mentioning air etc. that have become [its] foundations." The term Sandhi denotes a set of phonetic processes by which the contact phonemes of neighboring word tokens are changed and merged, and which create unseparated strings spanning multiple tokens (Whitney, 1879). Sandhi occurs between adjacent vowels (vocalic Sandhi; Ex. 2), between consonants and vowels (Ex. 4) and between adjacent consonants (Ex. 5): (2) rājā+uvāca 'the king said'¯a +u=o −→ rājovāca In addition, Sandhi occurs between independent inflected words (Ex. 2-5) as well as between members of compounds. 2 Because different combinations of unsandhied phonemes can result in the same surface phoneme, Sandhi resolution is non-deterministic and depends on the semantic context of the sentence (see Ex. 3 for a morphologically and lexically valid, but semantically dispreferred reading of the string rājovāca).
Scriptorial and editorial conventions further complicate the analysis of compounds and Sandhi. While most Indian manuscripts don't insert spaces between strings, modern editors use spaces a gusto. Moreover, the (correct) application of Sandhi is not followed by all authors and editors to the same extent, so that the unsandhied tokens tat hi asti ('as this is . . . ') can occur as taddhyasti, taddhy asti, tad dhy asti, tad dhyasti or even tat hi asti (unchanged).
Our models aim at transforming a given sentence into a sequence of unsandhied tokens. We refer to this task as Sanskrit word splitting (SWS), and subsume Sandhi and compounding phenomena under the common term splits. We address SWS by using a combination of convolutional and recurrent elements. The recurrent elements integrate sentence level information that leads to qualified decisions about the semantic meaningfulness of possible compound and Sandhi splits (see Ex. 2 and 3), while the convolutional elements are meant to replace n-gram extraction, which is frequently used in word segmentation architectures. As our models operate on the character level, SWS can be formulated in a sequence labeling framework.
Consequently, this paper has three main contributions: 1. We introduce novel character-based models for SWS that beat state of the art models by large margins.
2. We compare against sequence-to-sequence models and demonstrate that our models work on par with them, but need significantly less time for training and inference.
3. We publish a new dataset for Sanskrit word splitting that consists of more than 560,000 sentences with manually validated splits. The dataset and the code are released at https: //github.com/OliverHellwig/ sanskrit/papers/2018emnlp.
In the rest of this paper, we use the following terminology. A token is an unsandhied word that is not itself a compound. A string is a sequence of characters that is delimited by a space or a dan . d . a.
Each string contains at least one token, at least one compound (that itself consists of at least two tokens) or a Sandhied mixture of both. A sentence is a piece of Sanskrit text that is terminated by the punctuation mark called dan . d . a "stick" (|) and consists of at least one string. Any sentence can consist of multiple independent clauses, which are not demarcated by punctuation in Sanskrit, or consist of a part of a larger clause only. The paper proceeds as follows: Section 2 gives an overview of related work in NLP. Section 3 introduces our SWS dataset. Section 4 describes the sequence labeling models developed for this paper and three baseline systems, whose evalution is presented in Sec. 5. Section 6 summarizes the paper.
A number of recent papers approaches SWS with deep learning models. Hellwig (2015b) splits isolated strings by applying a one-layer bidirectional LSTM to two parallel character based representations of a string. The restriction to isolated strings is problematic, because SWS relies on the grammatical and semantic context of the full sentence in many cases. Restricting a model to isolated strings ignores these linguistic clues. Reddy et al. (2018) formulate SWS as a translation task on the sentence level. They transform surface and unsandhied sentences using the sentencepiece model and "translate" the surface into the unsandhied sentence using a seq2seq model with attention.  use an encoder-decoder architecture with a global attention mechanism and apply their model to isolated strings from a small dataset (Bhardwaj et al., 2018). So far, no direct comparison of deep learning models for SWS has been done, because the authors used different, partly unpublished datasets and reported performance on different linguistic levels (sentence, string) and with different evaluation methods. We will therefore try to make a fair and comprehensive comparison with the state of the art in Sec. 5.
SWS is closely related to word segmentation for other Asian languages such as Thai (Haruechaiyasak et al., 2008), Chinese or Japanese (Kanji), with most research being done for Chinese and Japanese. Contrary to Sanskrit, Chinese and Japanese don't exhibit Sandhi phenomena and their logographic scripts condense information, making it possible to use "word-level" CRFs on the output, for example. Chen et al. (2015) interpret Chinese word segmentation (CWS) as a sequence labeling task and evaluate a range of (stacked) bidirectional recurrent architectures that are combined with a final sentence level likelihood layer (Collobert et al., 2011) maximizing the transition score of the BMES encoded target sequence. Their best model uses a single layer bidirectional LSTM with bigrams of pre-trained character embeddings as inputs. Cai and Zhao (2016) deal with CWS by first forming word hypotheses from characters using a gated unit and then processing the word hypotheses with an LSTM-based language model. They minimize the combined word and sentence level scores using a structured margin loss and achieve better performance than Chen et al. (2015) on standard CWS datasets. Kitagawa and Komachi (2017) adapt the model proposed by Chen et al. (2015) for Japanese word splitting, but use characters, character n-grams and lexiconbased word boundary features as inputs. The authors report state of the art performance, but observe a clear drop in the F score of their model, when texts contain a high proportion of Hiragana characters and thus come closer to syllabic or alphabetic scripts.

Data
Several datasets for SWS have been published in the last years. While the dataset of Bhardwaj et al. (2018) may be too small and unvaried for training deep learning models, Krishna et al. (2017) reanalyze 560,000 sentences from the Digital Corpus of Sanskrit (DCS) 3 using the Sanskrit Heritage Reader (Goyal and Huet, 2016). Re-analysis is necessary, because the DCS stores the morpholexical analysis of strings, but does not record split points and Sandhi rules applied. Due to different linguistic choices (Pān . inian vs. corpusoriented) and to different ideas about the (non-)compositional meanings of compounds their final dataset contains only 115,000 sentences (see the discussion in Krishna et al. 2017 and the analysis in Sec. 3.1). As the size of the dataset is crucial for Surface rā j o vā c a Unsandhied rā jā-u vā c a Table 1: Data extracted from the string rājovāca, which is split into the two tokens rājā ("king") and uvāca ("(he, she) said"; see Ex. 2). most deep learning methods, we decided to release a new dataset along with this paper. Each sentence contained in the DCS is re-analyzed using the San-skritTagger software (Hellwig, 2009). Our dataset contains the surface forms of sentences in the DCS and the split points and Sandhi rules that the tagger proposes for their morpho-lexical gold analyses stored in the DCS. We didn't differentiate between compound and inter-word splits, as this distinction introduces morphological categories into the dataset. Table 1 shows an example of the annotation format. Table 2 shows the statistics of our dataset, split by text genres (first column). The dataset contains 2,978,509 strings and 4,171,682 tokens in 561,596 sentences. Most sentences come from the Epic and scientific (medicine, alchemy, astronomy) domain. While Epic texts are mostly written in easy, plain Sanskrit, the scientific works use many uncommon terms (likely to reoccur in the lexicographic domain) and long compounds. Sentence length is higher in the prose subcorpora (Buddhist, Vedic prose, ritualistic texts).
The fourth column shows that split phenomena are frequent in Sanskrit, occurring for more than 8% of all characters. Columns 5 and 6 report the proportions of complicated splits in relation to all splits. While 15% of all splits are resolved into a vocalic Sandhi, compound breaks are the dominant split type, which is also responsible for the majority of errors and ambiguities (see Sections 3.1 and 5.2). The last column also reflects the diachronic development from earlier texts with limited compounding (Vedic, ritualistic and Dharma texts) towards classical Sanskrit, which shows a strong preference for compounding. We use a fixed split of 90% of the sentences for training, a development set of 5% for parameter optimization and 5% for testing.

Quality of the training data
The dataset released by Krishna et al. (2017) and the one released with this paper both build on the DCS as gold standard. As this corpus was curated by a single user and the project never released a proper annotation guideline, one may suspect that it contains a certain level of inconsistencies and errors that influence the quality of the models and impose an upper limit for the model accuracy.
In order to estimate the size of these effects, the authors of this paper independently corrected the analyses of 50 sentences randomly drawn from the training set (250 words, 2,354 characters including spaces). The corrections made by the authors differed at 23 character positions, corresponding to 20 strings in 15 sentences. 16 of these differences concerned compound splits, where the authors disagreed about the (non-)compositional meaning of compounds. A good example for such a disagreement is the string rājayoga, which was split as rāja-yoga "king-Yoga" = "Yoga of a king" by one author (compositional reading), but left unchanged as the name of a school of Yoga by the other one (non-compositional reading). After adjudicating these disagreements, there remain 5 of 250 strings with annotation errors in the training data, which corresponds to an error level of 2% of all strings and 0.2% of all characters for this sample.
We further explored the effect of compositionality by independently splitting 56 sentences of the Buddhist treatise Trim .ś ikāvijñaptibhās . ya, which is not part of the DCS. As the text uses highly technical terminology, the degree of disagreement can be expected to be higher than for plain narrative texts. We adjudicated our Sandhi annotations, but kept conflicts in compound splitting unresolved. 94.5% of all strings (394 of 417) and 69.7% of all sentences obtained the same compound analysis by both authors. Again, the majority of differences (11 of 23) showed up when a compound can have a non-compositional meaning that is closely connected with its compositional reading. Evaluation will show that these cases are responsible for a large parts of the model errors.

Input Features
The character based models are trained with embeddings of the indidual surface characters, which are initialized with uniform random values from [−1, +1] and updated during training. Following Kitagawa and Komachi (2017), the input can be enriched with multinomial split probabilities that are built from the training data. When the training data contain a split rule for surface character t i at position i, we extract left (g L i,n ) and right (g R i,n ) character n-grams with lengths n ∈ [2, 7] 4 that end/start at position i, so that g L i,n = {t i−n+1 , t i−n+2 , . . . t i } and g R i,n = {t i , t i+1 , . . . t i+n−1 }. Counts #(.) for individual n-grams are accumulated over the whole training set. At training and test time, a vector v p ∈ R 2·(7−2+1)=12 is assigned to each character position. Its element corresponding to the left n-gram of length 2, for example, is calculated as We evaluate the influence of split probabilities in the ablation study (Sec. 5.2).

Extern Models for Comparison
We compare our models against the following baselines: Bidirectional RNN We re-implement the model described in Hellwig (2015b), but apply it to full sentences instead of isolated strings. Character embeddings are fed into a bidirectional recurrent layer with LSTM units. The output of the recurrent layer is additionally regularized by using dropout (Srivastava et al., 2014), and classification is performed using softmax with crossentropy loss. We decode the output of the softmax in a greedy fashion without considering interactions between adjacent output classes.
seq2seq We retrain the model described in Reddy et al. (2018) with our data after preprocessing them with the unsupervised text tokenizer sentencepiece (Schuster and Nakajima, 2012). 5 Transformer As an alternative to recurrency based seq2seq, we apply the model described in Vaswani et al. (2017) to the input pre-processed with sentencepiece. This model relies entirely on an attention mechanism to draw global dependencies between input and output. To our best knowledge, this is the first time that this model has been used for SWS. We use the publicy available implementation tensor2tensor. 6 .

Models Combining RNN and CNN
Convolutional Element Combinations of recurrent and convolutional (LeCun et al., 1998) elements are effective for tasks where complex local features are extracted by the convolutional element and then considered in larger contexts by the recurrent element (and vice versa; see Bjerva et al. 2016 or Ma and Hovy 2016). We use convolutional features c i as proposed by Kim (2014). Let w denote the width of the input matrix X of the convolution (= number of time steps), h its height, n the width of the convolutional filter f n ∈ R n×h , σ(.) a non-linearity (Rectified Linear Units (Nair and Hinton, 2010) in this paper) and b a bias. A convolutional feature at character position i and for filter j is defined as: The feature map c n i for m different filters is formed by concatenating the convolutional features (c n i = [c n i1 , c n i2 , . . . c n im ]) and the output c of the convolutional element is formed by concatenating the feature maps (c i = c 1 i ⊕ c 3 i ⊕ . . .). We use use odd filter widths only to avoid problems 5 Code for the model: https://github.com/ cvikasreddy/skt; for the tokenizer: https:// github.com/google/sentencepiece 6 https://github.com/tensorflow/ tensor2tensor with patch alignment. We tested convolution with small quadratic filters as used in image convolution as well as other methods for combining the learned filter such as averaging, addition or maxpooling of the stacked filters, but did not observe improved performance on the dev set.
Model 1: Convolution → Recurrency (crNN) As an alternative to n-gram extraction (Chen et al., 2015;Kitagawa and Komachi, 2017), a convolutional element is applied to the character embeddings (see Fig. 1a). Its outputs (Eq. 7) are fed into a bidirectional recurrent layer (Schuster and Paliwal, 1997). As in the baseline RNN (Sec. 4.2), dropout is inserted after the recurrent layer, and classification is performed using softmax with cross-entropy loss and greedy decoding.

Model 2: Recurrency → Convolution (rcNN)
The order of convolutional and recurrent elements is switched (see Fig. 1b), so that the convolutional operation replaces additive n-gram formation before the classification layer. The remaining architecture is identical to that of crNN Model 3: rcNN with Shortcuts (rcNN short ) This model extends rcNN by adding shortcut connections (Bishop, 2000) that concatenate the character embeddings and the RNN outputs with the concatenated feature maps c (see Fig. 1c). When e i denotes the embedding of character i and r i the output of the recurrent layer at position i, the input to the classification layer is defined as e i ⊕ r i ⊕ c i . Shortcuts are evaluated because we hypothesized that the access to unconvolved information about the input sequence and the output of the recurrent layer would facilitate the exact prediction of split locations. For a better control of information flow, we also experimented with residual (He et al., 2016) and highway (Srivastava et al., 2015) instead of shortcut layers, but could not observe improvements on the dev set, most probably because our models are not deep enough for these layer types to show effects.

Evaluation Settings
We use the following settings found on the dev set for the character based models: embedding size: 128; 200 hidden recurrent units; 100 convolutional feature maps with filter widths of 3,5 and 7. We use regularized (Zaremba et al., 2014)  Gradients with a magnitude higher than 5.0 are cut. The models used for model selection (Sec. 5.2) are trained for 5, the other character-based models for 10 iterations. We train the Transformer in its default configuration as described in Vaswani et al. (2017) with a vocabulary size 5k 7 and report performance on the test set based on evaluations on the dev set. The model of Reddy et al. (2018) is trained for 80 epochs with our training data and the same parameters as described in the original paper. All calculations are run on a Maxwell Titan X GPU. We compare the models using sentence accuracy ( #sens. with errors #all sens. ) and string based P(recision), R(ecall) and F score , where P and R are equivalent to the measures used in the CWS bakeoffs (Sproat and Emerson, 2003).

Model Selection
The upper half of Tab. 3 compares the evaluation metrics for the three character based models introduced in this paper trained with and without split probabilities (Sec. 4.1). We test differences in string accuracy using the McNemar test. 8 In general, all models that use recurrency before convolution (rcNN*) have string accuracy rates that are significantly higher at the 0.001 level than for models that use convolution before recurrency (crNN). Table 3 shows that the differences in the performance of crNN and rcNN* are almost as large as between the RNN baseline and the best model from this paper (lower half of Tab. 3), although crNN and rcNN* differ only by the switched order of recurrent and convolutional elements. We found this result surprising, because applying convolution to the character embeddings appeared like a good parametrized alternative to n-gram extraction, which is often the first step in architectures for Chinese and Japanese word segmentation.
To further investigate this phenomenon, we evaluated 60 randomly chosen strings from the test set in which either crNN split or (XOR) rcNN split short made an error. 45 of the errors relate to compound splitting, partly combined with vocalic Sandhi, either by missing a split (rcNN split short : 11, crNN split : 15) or by oversegmenting compounds (rcNN split short : 13, crNN split : 6). Most notably, rcNN split short tends to insert more splits than crNN split . This behavior can be observed for missing splits and especially for oversegmentations. A more detailed inspection shows that 11 of 13 oversegmentations actually induce a compositional reading of a compound. saralāṅga "name of a pine resin", for example, is oversegmented into sarala-aṅga "pinelimb", which is the etymological derivation of this compound. In contrast, crNN creates oversegmentations such asśr .ṅ gavanti-ah . , whereśr .ṅ gavanti "having horns" (nom. pl. neutre) is a valid form, while ah . is not an independent word form in Sanskrit. Interestingly, rcNN split short mis-segments the same string intośr .ṅ ga-vantyah . in another sentence of the test set. Though differing from the  gold analysis, this segmentation gives the correct derivational analysis of the adjective (nounśr .ṅ ga "horn" + inflected form of the adjectivizing possessive affix -vat). The results of rcNN split short thus reflect the inherent inconsistencies of the dataset on the level of compound splitting (see Sec. 3.1), and their erroneous splits are frequently semantically meaningful while glossing over minute semantic distinctions. Errors of crNN, in contrast, tend to be real mis-segmentations, indicating that its ability to reflect the semantic level is underdeveloped.
Split probabilities (Sec. 4.1) have a small, but positive effect on string accuracy of the rcNN* models. When the same model with and without split probabilities is compared using the McNemar test, split probabilities significantly increase string accuracy at the 0.1 level for rcNN and at the 0.001 level for rcNN short , while they don't result in significantly better performance for crNN.

Comparison with Baseline Models
The lower half of Tab. 3 compares the best model introduced in this paper (rcNN split short ) with baselines proposed for SWS in previous research. rcNN split short outperforms the character based RNN described in Hellwig (2015b) by a wide margin. While Tab. 3 shows differences of almost 8% in sentence and 3% in string accuracy, Tab. 4 presents the improvements for the single surface character Hellwig (2015b) Table 4: P, R and F for rules that produce the surface phonemeā. Data in the left half are from the original publication. As all metrics are consistently better for this paper, we refrain from highlighting the best results in the right half of the table.
a, which can correspond to a compound split (ā-) or to various vocalic Sandhis (a-a etc.). For this complicated character, rcNN split short achieves consistent improvements of up to 15% on all metrics. We found it especially relevant to observe that rcNN split short made large progress for rare rule types such asā-ā orā-a, indicating its increased ability for semantic generalization.
The seq2seq model (Reddy et al., 2018) performs on a similar level of accuracy as the one proposed in Hellwig (2015b). Similar to crNN (Sec. 5.2) it tends to miss splits and to insert faulty ones (e.g., dānādānaratih . , should: dāna-ādānaratih . "pleasure in giving and taking", is: dānātanaratih . "from giving . . . UNK").  evaluate their model using location and split 9 prediction accuracy. The authors report 95.0 location and 79.5 split accuracy, but don't specify how they calculated these values. For this reason and because they evaluate on isolated strings only, we cannot compare directly against their work, but would like to report the following measures for rcNN split short : • P, R and F for location prediction 10   The Transformer performs almost on par with rcNN split short , and the differences in string accuracy are not statistically significant, although rcNN split short takes less time for training (2 h vs. 55 h) and inference (less than 1 min vs. 30 min when analyzing the test set). To better understand if the systems make orthogonal errors and could therefore be used in a mixture of experts, we performed a domain-specific evaluation with 73 sentences from the Buddhist treatise Trim .ś ikāvijñaptibhās . ya and 104 sentences from the philosophical text Nyāyamañjarī. We preserved the non-standard orthography of both texts in order to simulate the application of the models to real-world data. This includes the presence of typos, unsolved textual problems and erratic (non)-application of Sandhi.
Both models show a significant drop in overall performance when applied to these data (see Tab. 5). This is not surprising, because the input conventions of these files do not match the conventions of the training-data. Most errors arise again from disagreement about the (non-)compositional reading of technical compounds such as sarvajña-tva "all-knowing-ness" (see Sec. 3.1). It has to be noted that both models agree well in their correct decisions and in the type of errors they produce on these data. This indicates that the discrepancy in the orthographical conventions is indeed responsible for a large part of the drop in performance. Given the fact that both texts exhibit a lot of special vocabulary that is not present or used in a very different way in the training set, both models perform surprisingly well. Typical errors common to both models are for example svalpam instead of su-alpam "very small". Both   Koehn and Knight (2003) models have difficulties to seperate Sandhi in passages that do not adhere to the common practice for typesetting of Indian texts in Latin transliteration. ayam . parin .ā mah . , for example, was not separated into the usual form ayam . parin .ā mah . . There are certain cases of disagreement between both models that are noteworthy. While Transformer has changed the misspelled word abhupagamyate to the correct form abhyupagamyate in one case (overlooked by rcNN split short ), rcNN split short correctly identified the verbal form upacaryante iti, where Transformer inserted the semantically dispreferred, but grammatically possible present participle upacaryantah . iti. Overall, none of the models shows a generally better or worse performance in these cases of disagreement.

Application to German Compounds
In order to test if the character based models generalize well to other languages with limited training resources, we applied rcNN short with split probabilities and the same settings as for SWS to the task of splitting German compounds. The current state of the art is set with a CRF operating on n-grams of characters . Table 6 shows that our model achieves an improvement of about 1% for recall and accuracy when trained with the training set of  only. We sampled 20 examples for the three error classes "wrong split", "wrong faulty split" and "wrong non-split" (Ma et al., 2016, 78). While our model failed to detect splits for all 20 examples of the type "wrong non-split", the type "wrong split" contained 10 cases, where the split(s) proposed by the model make(s) good sense for us, but are not recorded in the test set (e.g. "Viermaster" 'fourmaster', "Viermaster" in test; already remarked by ). We observed a similar level of inconsistencies for the "wrong faulty split" type (8 instances), where, for example, our model analyzed "Bundes-tags-vize-präsident" 'vice presi-dent of the Federal Parliament', while the test set had "Bundes-tags-vizepräsident".

Conclusion
While the models discussed in this paper have produced clear performance gains when compared with previous research on SWS, we expect that future research will improve over our results, but it will be difficult to approach error-free performance. The reservation is due to the errors in the training data and especially the question of (non-)compositional readings of compounds, which seems to produce related levels of confusion for human annotators and ML models. While following this track of research, we would like to expand its scope to joint learning of splits, lexical and morphological annotations. Here, we expect that especially lexical and morphological analysis will benefit from a joint model. We hypothesize that CTC (Graves, 2012) trained as a co-task or segmental NNs (Lu et al., 2016) with a modified objective (including split probabilities) may be suitable for this task.