Keep It Surprisingly Simple: A Simple First Order Graph Based Parsing Model for Joint Morphosyntactic Parsing in Sanskrit

Morphologically rich languages seem to ben-eﬁt from joint processing of morphology and syntax, as compared to pipeline architectures. We propose a graph-based model for joint morphological parsing and dependency parsing in Sanskrit. Here, we extend the Energy based model framework (Krishna et al., 2020), proposed for several structured prediction tasks in Sanskrit, in 2 simple yet signiﬁ-cant ways. First, the framework’s default input graph generation method is modiﬁed to generate a multigraph, which enables the use of an exact search inference. Second, we prune the input search space using a linguistically motivated approach, rooted in the traditional grammatical analysis of Sanskrit. Our experiments show that the morphological parsing from our joint model outperforms standalone morphological parsers. We report state of the art re-sults in morphological parsing, and in dependency parsing, both in standalone (with gold morphological tags) and joint morphosyntactic parsing setting.


Introduction
Morphology and syntax are often inextricably intertwined for morphologically rich languages (MRLs). For such languages, it might be unrealistic to design dependency parsers that expect correct morphological tags to be provided as input (More et al., 2019;Bohnet et al., 2013). Jointly modelling morphological parsing (MP) with dependency parsing (DP) has shown to be effective for several MRLs (More et al., 2019). In this work, we present multigraph-EBM (MG-EBM), a joint model for morphosyntactic parsing, i.e. joint MP and DP, in Sanskrit.
Morphosyntactic parsing has been successfully applied to several MRLs. Bohnet et al. (2013) proposed a transition based joint parser, extending the * Work done while at IIT Kharagpur joint POS tagger and dependency parser of Bohnet and Nivre (2012). Similarly Seeker and Ç etinoglu (2015) proposed a joint graph based parser for Turkish. Here, two different models, one predicting a morphological path and the other a dependency tree, are made to reach an agreement using dual decomposition. More et al. (2019) proposed a transition based joint parser for Hebrew, where it aims to maximise a global score over both morphological and dependency transitions.
Sanskrit is an MRL which shows high degree of syncretism and homonymy in its morphological paradigm. About 90.96 % of the tokens in a dataset of 115,000 Sanskrit sentences (Krishna et al., 2017) show syncretism with an average of 3.62 morphological tags per token. Morphological features, especially case, are indicative of the syntactic roles that a word (nominal) can assume in a sentence. The interplay of morphological markers and syntactic roles has been formalised in the traditional grammatical analysis of Sanskrit, Ashtādhyāyī (Pān . ini, 500 BCE; Ramkrishnamacharyulu, 2009). Here, the joint modelling of syntactic and morphological information can help disambiguate each other (Tsarfaty, 2006). In MG-EBM, we use this information to prune the input search space. Krishna et al. (2020) proposed an energy basedmodel (EBM) framework for multiple structured prediction tasks in Sanskrit. For all the models under EBM, the input search space is a graph which considers every unique morphological analysis of the input words to be a separate node (Figure 1a). Modeling morphosyntactic parsing over this input graph requires an approximation algorithm for inference as it needs to predict a structure containing only a subset of the nodes. We propose to modify the input space to be a multigraph where the number of nodes correspond to the number of tokens in a sentence. This enables us to use an exact search inference (Edmonds, 1967) (2005). MG-EBM achieves state of the art results (SOTA), improving the previous best results by 3 F-score points for MP and 2 UAS points for standalone DP (expects gold morph tags as input). We set new SOTA results for joint MP and DP. Further, we demonstrate that MP results obtained from our joint model outperforms standalone MP models. Our proposed pruning approach in itself report an improvement of 6 F-score (2 UAS) points with the original EBM configuration for MP (DP), and a further 2 F-Score (2 UAS) points improvement with the multigraph formulation for MP (DP). Krishna et al. (2020) proposed an Energy based model (EBM) framework (LeCun et al., 2006) for multiple structured prediction tasks in Sanskrit. The framework is a generalisation of the joint word-segmentation (WS) and morphological parsing (MP) model by Krishna et al. (2018). The models under this framework are trained using multilayer perceptrons and are essentially first-order arc-factored graph-based parsing models. The dependency parsing (DP) model, similar to McDonald et al. (2005), makes use of a sequence level max-margin loss (Taskar et al., 2003) and Chu-Liu-Edmonds algorithm for the inference. However, the feature function for the task is learnt automatically, and differs from that of McDonald et al. (2005). A lexicon-driven shallow parser (Goyal and Huet, 2016;Huet, 2005) is used to enumerate all the morphological analyses, including cases of syncretism and homonymy, for the tokens in the input sequence. An input graph is constructed from this analysis, as shown in Figure 1a for the sequence "śriyah . patih .ś rīmati". 1 Here every node is a unique combination of three entities, namely, surface-form, stem and morphological tag. All the node pairs, which are not suggested as alternative solutions and hence can co-occur in a predicted solution, form an edge.

Energy Based Model
The framework uses an automated feature learning approach (Lao and Cohen, 2010;Meng et al., 2015) to generate a feature function consisting of 850 features. Using this feature set the framework achieves state of the art (SOTA) results in several tasks. This is significant, given several morphologically rich languages still rely on models that use hand-crafted features for SOTA results (More et al., 2019;Seeker and Ç etinoglu, 2015). Given the models are arc-factored, the edges are featurised. A feature would consider only one entity each from either of the nodes in the edge. The feature then calculates the distributional information between these entities conditioned on some specific morphological constraint. The type of the entities and the constraints, which constitute the features, are automatically learned as typed paths over a large morphologically tagged corpus. While the training and the feature function remain the same for all the tasks under the framework, the inference is task specific. It searches for a spanning tree with minimum energy (Edmonds, 1967) for DP. For tasks that require prediction of a subset of nodes from the input graph, such as WS and MP, and standalone MP, approximation algorithms were used for inference. Here the inference procedure searches for only a small percentage of all the possible candidates (< 1%) (Krishna et al., 2018).

MG-EBM: The Proposed Model
Multigraph-EBM (MG-EBM) extends the EBM framework in two simple yet significant ways.
Multigraph formulation: Instead of the 'one node per unique morphological analysis', as shown in Figure 1a, we propose to use a 'one node per inflected surface-form' multigraph representation, as shown in Figure 1b. For instance, the surface formśriyah . , due to syncretism, has 2 possible morphological analyses M1 and M2. In Figure 1a, these analyses are represented as separate nodes and are connected to the only analysis of patih . via the edges e and g. In Figure 1b, the cases of syncretism forśriyah . are merged as a single node, though the edges e and g to patih . are retained. This leads to a multigraph formulation. 2 The new representation retains all the edges, and their feature vectors, present in the original representation. The design of our feature function guarantees that every edge will have a unique feature vector. With this formulation, we simplify the search problem for the joint MP and DP task to that of searching for the spanning tree with minimum energy. This enables the use of the exact search Edmonds-Chu-Liu MST algorithm (Edmonds, 1967), rather than an approximation algorithm, for inference. It is straightforward to extend the algorithm to multigraph, as we just need to retain only the minimum energy edge and prune out all the other edges between a pair of nodes in the input graph (McDonald and Satta, 2007).
Linguistically Motivated Pruning: Linguistic constraints based on the traditional grammatical analysis and verbal cognition in Sanskrit (Kulkarni and Ramakrishnamacharyulu, 2013;Ramkrishnamacharyulu, 2009) have been previously employed in various deterministic dependency parsers for Sanskrit (Kulkarni et al., 2019;Kulkarni, 2013;Kulkarni et al., 2010). We use these constraints to prune the edges in our input graph. During pruning, we first exhaustively enumerate all the unlabelled directed spanning-trees in the input graph using Mayeda and Seshu (1965 assigned at least one label as per the rules of the grammar, it will be considered a valid candidate. Finally, all the edges which are not part of even one valid candidate-tree will be pruned from the input graph. This pruned unlabelled directed graph serves as the input for the inference procedure. This linguistically informed pruning can at best be seen as a rule-based deterministic delexicalised dependency parsing, which considers information only from the morphological tags. Morphology can signal dependency in various ways. The morphological marker of a word may not only index the properties of the word itself, but it may also index the agreement between its head or dependants (Nichols, 1986). The agreement between the subject and verb in terms of the number and person respectively is one such case. Similarly, the case, number, and gender agreement between the words in an adjectival modifier (viśes . an . a) relation is another example of this in Sanskrit. Further, morphological markers are indicative not only of the presence of syntactic dependency between the words in a sentence, but also of the type of the syntactic dependency shared between them (Nichols, 1986;Seeker and Kuhn, 2013). In Sanskrit, the case information of a nominal narrows down the possible relations it can have with a verb as the head. This is shown in Figure 2. We form constraints based on these morphosyntactic information and use it for pruning the edges. 4 The dependency relations shown in Figure 2   The EBM variants and the impact of these variations would be elaborated in Section 5. For the joint morphosyntactic setting, we propose DCST++ as a neural baseline. DCST++ is our augmentation over DCST which integrates encoder outputs from a neural morphological tagger (Gupta et al., 2020) by a gating mechanism (Sato et al., 2017). 5 Metric: All the results we report are macro averaged at a sentence level. For DP, we use UAS and LAS and for MP, we use F-Score. For joint MP and DP, all the EBM models other than MG-EBM may predict a tree that has a different vertex set than that of the ground truth. Since UAS cannot be used here, we use (unlabelled and labelled) F-Score for those systems. For MG-EBM UAS/LAS and Unlabelled/Labelled F-Score would be the same. Dataset 6 : We use a test set of 1,300 sentences, 5 Refer to the supplementary material §2 for more experiments with this model 6 The dataset can be downloaded from http://bit.
where 1,000 come from the Sanskrit Tree Bank Corpus (Kulkarni, 2013, STBC) and 300 from Sisupāla-vadha, a work from classical Sanskrit poetry (Ryali, 2016). 1,500 and 1,000 sentences from STBC, other than the ones in test data, were used as the training and validation data respectively for DCST, DCST++, and BiAFF. However all the EBM models and YAP were trained on 12,320 sentences obtained by augmenting the training data in STBC (Krishna et al., 2020, §4.1). 7

Results
MG-EBM achieves the state of the art (SOTA) results in MP and in DP, both in standalone (with gold morphological tags as input) and joint morphosyntactic parsing setting. Table 1a shows that MG-EBM reports a 2 point improvement in UAS as compared to T-EBM*, the previous SOTA model for DP in Sanskrit. Both MG-EBM and T-EBM* differ only in terms of how the pruning of the input graph is performed. The pruning decisions in T-EBM* are made by considering a maximum of 3 nodes at a time, instead of the context from the entire tree. However, MG-EBM considers a tree in its entirety for applying the constraints. This leads to more than 300 fold reduction in the number of possible candidates for MG-EBM as compared to T-EBM*, with just about 40 % increase in wall time (on test data). 8 Since, both the models use the same label predictor (Krishna et al., 2020), they perform similar, with a small improvement of 0.77 points for MG-EBM . Table 1b shows that MG-EBM outperforms C-EBM, the previous SOTA model for morphologily/KISSData 7 BiAFF, DCST and DCST++ performed worse, when used with the sentences from the augmented training data. 8 The increase is due to the use of the spanning tree enumeration algorithm by Mayeda and Seshu (1965). cal parsing. Similarly MG-EBM achieves SOTA results for DP in the joint setting, followed by DCST++. In the joint setting, gold morphological tags are not provided as input. All the EBM models, other than MG-EBM, use the one node per analysis (Figure 1a) input formulation and approximation algorithms for inference. For morphological parsing, the inference in C-EBM searches for a maximal clique, considering pairwise interaction between all the nodes in the clique, while P-EBM searches for a Steiner Tree. Both JP-EBM* and JP-EBM-Prune extend P-EBM for joint morphosyntactic parsing, by introducing linguistically informed pruning. JP-EBM* uses the same pruning approach as T-EBM*, while JP-EBM-Prune uses our proposed pruning approach. The models report a 5 point and 11 point F-Score increase respectively for morphological parsing as compared to P-EBM. In fact, JP-EBM-Prune outperforms C-EBM. MG-EBM and JP-EBM-Prune use the same pruning approach proposed in this work. They differ in terms of the input space formulation and as a consequence, MG-EBM uses an exact search inference. This difference has led to nearly 2 point increase in both UAS and F-Score, and a 3 Point increase in LAS between both. YAP (More et al., 2019), the SOTA joint morphosyntactic parser proposed originally for Hebrew can perform joint prediction. However it is observed that YAP's performance would typically degrade in the joint setting as compared to its performance in the standalone setting (with gold-morphological tag; Table 1a). All the joint models for morphological parsing and DP outperform YAP even when YAP uses gold morphological tags. Finally, all the joint models for morphological parsing and DP outperform the pipeline EBM model C-EBM + T-EBM*, which validates that joint morphosyntactic parsing benefits an MRL like Sanskrit than a pipeline model.

Conclusion
In this work, we proposed MG-EBM, a model for joint morphological parsing and DP in Sanskrit. It extends the EBM framework from Krishna et al. (2020) by 1) incorporating a linguistically motivated pruning approach resulting in a substantial reduction in the input search space, and 2) modifying the input graph formation to a multigraph resulting in the use of Edmonds-Chu-Liu algorithm (Edmonds, 1967), an exact search algorithm, as inference. While the multigraph formulation is language agnostic the linguistically motivated pruning is rooted on the grammatical tradition of Sanskrit. Experiments validate that the joint morphosyntacticparsing hypothesis, i.e., morphological information can benefit syntactic disambiguation and vice versa (Tsarfaty, 2006), holds true for Sanskrit. We find that the MG-EBM reports state of the art results (SOTA) for morphological parsing, outperforming standalone morphological parsing models, similar to what is observed for Hebrew (More et al., 2019). Further, all the joint morphological parsing and DP variants of EBM, we experimented here, result in a superior performance than the pipeline morphological parsing and DP EBM model. We also establish SOTA results in Sanskrit for DP, both in standalone and joint setting.