A Simple and Effective Dependency Parser for Telugu

We present a simple and effective dependency parser for Telugu, a morphologically rich, free word order language. We propose to replace the rich linguistic feature templates used in the past approaches with a minimal feature function using contextual vector representations. We train a BERT model on the Telugu Wikipedia data and use vector representations from this model to train the parser. Each sentence token is associated with a vector representing the token in the context of that sentence and the feature vectors are constructed by concatenating two token representations from the stack and one from the buffer. We put the feature representations through a feedforward network and train with a greedy transition based approach. The resulting parser has a very simple architecture with minimal feature engineering and achieves state-of-the-art results for Telugu.


Introduction
Dependency parsing is extremely useful for many downstream tasks. However, robust dependency parsers are not available for several Indian languages. One reason is the unavailability of annotated treebanks. Another reason is that most of the existing dependency parsers for Indian languages use hand-crafted features using linguistic information like part-of-speech and morphology (Kosaraju et al., 2010;Bharati et al., 2008;Jain et al., 2012) which are very expensive to annotate. Telugu is a low resource language and there hasn't been much recent work done on parsing. Most of the previous work on Telugu dependency parsing has been focused on rule based systems or datadriven transition based systems. This paper focuses on improving feature representations for a low resource, agglutinative language like Telugu using the latest developments in the field of NLP such as contextual vector representations.
Contextual word representations (Howard and Ruder, 2018;Peters et al., 2018;Devlin et al., 2019) are derived from a language model and each word can be uniquely represented based on its context. One such model is BERT (Devlin et al., 2019). BERT vectors are deep bidirectional representations pre-trained by jointly conditioning on both left and right context of a word and have been shown to perform better on variety of NLP tasks.
In this paper, we use BERT representations for parsing Telugu. We replace the rich hand-crafted linguistic features with a minimal feature function using a small number of contextual word representations. We show that for a morphologically rich, agglutinative language like Telugu, just three word features with good quality vector representations can effectively capture the information required for parsing. We put the feature representations through a feed forward network and train using a greedy transition based parser (Nivre, 2004(Nivre, , 2008. Past work on Telugu dependency parsing has only been focused on predicting inter-chunk dependency relations (Kosaraju et al., 2010;Kesidi et al., 2011;Kanneganti et al., 2016Kanneganti et al., , 2017Tandon and Sharma, 2017). In this paper, we also report parser accuracies on intra-chunk annotated Telugu treebank for the first time.

Related Work
Extensive work has been done on dependency parsing in the last decade, especially due to the CoNLL shared tasks on dependency parsing. Creation of Universal Dependencies (Nivre et al., 2016) led to an increased focus on common approaches to parsing several different languages. There were new transition based approaches making use of more robust input representations (Chen and Manning, 2014;Kiperwasser and Goldberg, 2016) and improved network architectures (Ma et al., 2018).
Graph based approaches to dependency parsing have also become more common over the last few years (Kiperwasser and Goldberg, 2016;Dozat and Manning, 2017, inter alia).
However, there hasn't been much recent work on parsing Indian languages and much less on Telugu. Most of the previous work on Telugu dependency parsing has been focused on rule based systems (Kesidi et al., 2011) or data-driven transition based systems (Kanneganti et al., 2016) using Malt parser (Nivre et al., 2006). The Malt parser uses a classifier to predict the transition operations taking a feature template as input. The feature templates used in Telugu parsers commonly consist of several hand-crafted features like words, their partof-speech tags, gender, number and other morphological features (Kosaraju et al., 2010;Kanneganti et al., 2016). There has been some work done on representing these linguistic features using dense vector representations in a neural network based parser (Tandon and Sharma, 2017).
Recent developments in the field of NLP led to the arrival of contextual word vectors (Howard and Ruder, 2018;Peters et al., 2018;Devlin et al., 2019) and their extensive use in downstream NLP tasks, from POS tagging (Peters et al., 2018) to more complex tasks like Question Answering and Natural Language Inference tasks (Devlin et al., 2019). Contextual vectors have also been applied to dependency parsing systems. The top-ranked system in CoNLL-2018 shared task on Dependency Parsing (Che et al., 2018) used ELMo representations along with conventional word vectors in a graph based parser. Kulmizev et al. (2019); Kondratyuk and Straka (2019) use contextual vector representations for multilingual dependency parsing.
In this paper, we train a BERT-baseline model (Devlin et al., 2019) on Telugu Wikipedia data and use these vector representations to improve Telugu dependency parsing.

Telugu Dependency Treebank
We use the Telugu treebank made available for ICON 2010 tools contest. We extend this treebank by another 900 sentences from the HCU Telugu treebank. The size of the combined treebank is around 2400 sentences. The treebank is annotated using Computational Paninian grammar (Bharati et al., 1995;Begum et al., 2008) proposed for Indian languages. The treebank is annotated at interchunk level (Bharati et al., 2009) in SSF (Bharati et al., 2007 format. Only chunk heads in a sentence are annotated with dependency labels. We automatically annotate the intra-chunk dependencies (Bhat, 2017) using a Shift-Reduce parser based on Context Free Grammar rules within a chunk, written for Telugu 1 . Annotating the intrachunk dependencies provides a complete parse tree for each sentence.

Our Approach
We propose to replace the rich hand-crafted feature templates used in Malt parser systems with a minimally defined feature set which uses automatically learned word representations from BERT. We do not make use of any additional pipeline features like POS or morphological information assuming this information is captured within the vectors. We train a BERT baseline model (Devlin et al., 2019) on the Telugu wikipedia data, which comprises 71289 articles. We use the ILMT tokenizer included in the Telugu shallow parser 3 to segment the data into sentences. The sentence segmented data consists of approximately 2.6M sentences. We convert all of the data from UTF to WX 4 notation for faster processing. We use byte-pair encoding (Sennrich et al., 2016) to tokenize the data and generate a vocabulary file. We pass this vocabulary file to BERT 5 for pre-training. After pre-training, we extract contextual token representations for all the sentences in the treebank from the pre-trained BERT model. In case a single word is split into multiple tokens, we treat these tokens as continuous bag of words and add the representations of all the tokens in a word to obtain the word representation. We find that this approach works better than considering only the first word-piece vector (Kulmizev et al., 2019;Kondratyuk and Straka, 2019). We use these word representations as input features to the parser. Our feature function is a concatenation of a small number of BERT vectors and we integrate it into a transition based parser. The specific details are mentioned in Section 4.2

Transition based Dependency Parsing
Transition based parsers process a sentence sequentially and treat parsing as a sequence of actions that produce a parse tree. They predict a sequence of transition operations starting from an initial configuration to a terminal configuration, and construct a dependency parse tree in the process. A configuration consists of a stack, an input buffer of words, and a set of relations representing a dependency tree. They make use of a classifier to predict the next transition operation based on a set of features derived from the current configuration. A couple of widely used transition systems are Arc-standard (Nivre, 2004) and Arc-eager (Nivre, 2008). We make use of the Arc-standard transition system in our parser and briefly describe it here.

Arc-standard Transition System
In the arc-standard system, a configuration consists of a stack, a buffer, and a set of dependency arcs. The initial configuration for a sentence s = w 1 , ..., w n consists of stack = [ROOT ], buffer = [w 1 , ..., w n ] and dependencies = []. In the terminal configuration, buffer = [] and stack = [ROOT ], and the parse tree is given by dependencies. The root node of the parse tree is attached as the child of ROOT . The arc-standard system defines three types of transitions that operate on the top two elements of the stack and first element of the buffer: • LEFT-ARC: Adds a head-dependent relation between the word at the top of stack and the 5 https://github.com/google-research/ bert word below it and removes the lower word from the stack.
• RIGHT-ARC: Adds a head-dependent relation between the second word on the stack and the top word and removes the top word from the stack.
• SHIFT: Moves the word from the front of the buffer onto the stack.
In the labeled version of parsing, there are a total of 2 + 1 transitions, where is the number of different dependency labels. There is a left-arc and a right-arc transition corresponding to each label. The label left-arc vmod adds a head-dependent relation between the top two words of the stack (s 0 , s 1 ) with label vmod, dependencies=[(s 0 , s 1 , vmod),...]

Feature Function
We use a minimally defined feature set consisting solely of word representations obtained from BERT. We do not incorporate any part-of-speech or morphological information separately. The intuition is that such information is already captured within the BERT representations. Our feature set consists of word representations of the top two elements of the stack (s 0 , s 1 ) and the first element of the buffer (b 0 ). We compute a feature vector, by concatenating (•) the vector representations of all the words in the feature set, where v i is the vector representation of the word i,

Classifier
We use a fully connected Feed Forward Network with one hidden layer with ReLU activation to score all the possible parser transitions. The next transition is predicted based on the features extracted from the current configuration. We compute the scores of all transitions, transition scores(f ) = W 2 ·relu(W 1 ·f +b 1 )+b 2 where f is the feature vector obtained from the current configuration. A softmax layer is applied over the transition scores to get the probability distribution. We pick a valid transition with the highest probability. We use a dropout layer with probability 0.2 for regularization.

Experiments and Results
The Telugu dependency treebank is quite small in size consisting of only 2400 sentences. We also observe that the sentence length and quality of annotation in the treebank is not uniform and has a high amount of variation. We therefore evaluate our parser on the treebank using ten-fold crossvalidation. We report the cross-validation accuracies on both inter-chunk (Table 2) and intra-chunk (Table 1) annotated treebanks. Parser accuracies on intra-chunk annotated Telugu treebank are reported for the first time in this paper. The overall parser accuracies improve on the intra-chunk annotated treebank. We compare these results with a baseline using only word2vec word representations and subsequently adding Part-of-speech (POS) and suffix representations described in (Tandon and Sharma, 2017). We also try to reproduce Tandon and Sharma (2017) experiments on both inter-chunk and intra-chunk annotated treebanks. Tandon and Sharma (2017) report their best results for Telugu on the inter-chunk annotated treebank using word, POS and suffix representations. Their results are reported on a test set and since their exact dataset is not available, we report average 10-fold cross validation accuracies. The reproduced results are listed in Table 3. As can be seen from the table the average cross-validation accuracies are lower. The discrepancy between rows 3 and 4 is because of a larger feature set and a different optimizer. Tandon and Sharma (2017) use 13 features from the parse configuration instead of our three features which introduce unnecessary noise, when the average sentence length is as small as five. We also find that Adam optimizer performs better than the Adagard optimizer used in their setup.
Implementation details: The parser comprises of simple feed forward neural network with one hidden layer consisting of 1024 hidden units and a relu activation function and a dropout layer with dropout probability of 0.2. We use xavier uniform initialization (Glorot and Bengio, 2010) to initialize the network parameters and Adam optimizer (Diederik P. Kingma, 2015) with default momentum and learning rates provided by PyTorch. We use BERT baseline model for pre-training and each BERT token representation is of dimension 768.
Arc-standard vs Arc-eager: We experiment with both Arc-standard (Nivre, 2004) and Arc-Eager (Nivre, 2008) transition systems and find that Arc-standard works better in our case (Table 4). We use Arc-standard transition system in all further experiments.
Feature Function: We experiment with different feature sets and find that using just three features, the top two elements of the stack and the top-most element of the buffer result in the highest accuracies. Extending the feature set to include more elements from the stack or buffer causes the accuracies to fall. Parser accuracies using different feature sets are reported in Table 5. Peters et al. (2018) and Che et al. (2018) suggest that concatenating conventional word vectors with contextual word vectors could result in a boost in accuracies. We try out the same by concatenating word2vec vectors with BERT vectors and observe some improvement in label scores. The results are mentioned in Table 6.
BERT layers: We also experiment with vector representations from different layers of BERT. The results are mentioned in Table 7. We find that the 4 th layer from the top of our BERT baseline model results in the highest accuracy for the parser. This finding is consistent with the work of Tenney et al. (2019) which suggests that dependencies are better captured between layers 6 and 9. We use the vector representations from 4 th layer from the top in all our experiments.        is encoded in the suffixes. Intuitively, segmenting the words from right to left (inverse-BPE) could lead to linguistically better word segments. We test out this assumption (Table 8). We use 60k merge operations in both cases. Inverse-BPE leads to slightly better unlabeled attachment scores but causes a slight drop in label scores.

Error Analysis
In this section we look at some of the most common errors made by this parser and try to understand why those errors might be occurring. We evaluate the parser on a test-set of 240 sentences. The most frequently occurring errors are k1(agent/subject) and k2(object/patient) mismatch, k1 is labeled as k2 and vice versa. k1 and k2 are the most fre-quently occurring labels after ROOT . 78% sentences in the test-set contain k1 dependency and 50% sentences contain k2 dependency. k1 is labeled as k2 15% of the time and k2 is labeled as k1 18% of the time. These errors are usually seen when the words occur without case-markers. In these cases, k1 and k2 can be distinguished by looking at the verb agreement. Fixing these two errors would greatly improve the parser.
Other frequently occurring errors are confusion between k2 and k4(recipient) since they sometimes take the same case-markers, nmod and nmod adj, vmod and adv , sent adv labels. The label vmod is ambiguous in general and can be easily confused with adverbs.

Conclusion and Future Work
We present a simple yet effective dependency parser for Telugu using contextual word representations. We demonstrate that even with vectors trained on a small corpus of 2.6M sentences, we can reduce the need for explicit linguistic features in deep learning based models. We show based on the results of the parser that BERT vectors effectively capture much of the linguistic information required for parsing. We also show that with good vector representations, a small feature set is more effective for a morphologically rich, agglutinative language like Telugu.
Future work could include finding a way to incorporate other linguistic features like case-markers, gender, number, person, tense, aspect and verb agreement information into the parser.