Tree-Stack LSTM in Transition Based Dependency Parsing

We introduce tree-stack LSTM to model state of a transition based parser with recurrent neural networks. Tree-stack LSTM does not use any parse tree based or hand-crafted features, yet performs better than models with these features. We also develop new set of embeddings from raw features to enhance the performance. There are 4 main components of this model: stack’s σ-LSTM, buffer’s β-LSTM, actions’ LSTM and tree-RNN. All LSTMs use continuous dense feature vectors (embeddings) as an input. Tree-RNN updates these embeddings based on transitions. We show that our model improves performance with low resource languages compared with its predecessors. We participate in CoNLL 2018 UD Shared Task as the “KParse” team and ranked 16th in LAS, 15th in BLAS and BLEX metrics, of 27 participants parsing 82 test sets from 57 languages.


Introduction
Recent studies in neural dependency parsing creates an opportunity to learn feature conjunctions only from primitive features. (Chen and Manning, 2014) A designer only needs to extract primitive features which may be useful to take parsing actions. However, extracting primitive features from state of a parser still remains critical. On the other hand, representational power of recurrent neural networks should allow a model both to summarize every action taken from the beginning to the current state and tree-fragments obtained until a current state.
We propose a method to concretely summarize previous actions and tree fragments within current word embeddings. We employ word and context embeddings from (Kırnap et al., 2017) as an initial representer. Our model modifies these embeddings based on parsing actions. These embeddings are able to summarize, children-parent relationship. Finally, we test our system in CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies.
Rest of the paper is organized as follows: Section 2 summarizes related work done in neural transition based dependency parsing. Section 3 describes the models that we implement for tagging, lemmatization and dependency parsing. Section 4 discusses our results and section 5 presents our contributions.

Related Work
In this section we describe the related work done in neural transition based dependency parsing and morphological analysis.

Morphological Analysis and Tagging
Finite-state transducers (FST) have an important role in previous morphological analyzers. (Koskenniemi, 1983) Unlike modern neural systems, these type of analyzers are language dependent rule based systems. Morphological tagging, on the other hand, tries to solve tagging and analysis problem at the same stage. Koskenniemi proposed conditional random fields (CRFs) based model and Heigold et al. proposed neural network architectures to solve tagging and analysis problem immediately. Modern systems heavily based on word and context based features that we explain in the following paragraph.

Embedding Features
Chen and Manning, Kiperwasser and Goldberg, use pre-trained word and random part-of-speech (POS) embeddings. Ballesteros et al. use character-based word representation for the stack-LSTM parser. In Alberti et al., end-to-end ap-proach is taken for both word and POS embeddings. In other words, one component of their model has responsibility to generate POS embeddings and the other to generate word embeddings.

Decision Module
We name a part of our model, which provides transitions from features, as decision module. Decision module is a neural architecture designed to find best feature conjunctions. Chen

Model
In this section, we describe MorphNet (Dayanık et al., 2018) used for tagging and lemmatization; and Tree-stack LSTM used for dependency parsing. We train these models separately. MorphNet employs UDPipe (Straka et al., 2016) for tokenization to generate conll-u formatted file with missing head and dependency relation columns. Treestack LSTM takes that for dependency parsing. We detail these models in the remaining part of this section.

Lemmatization and Part of Speech Tagging
We implement MorphNet (Dayanık et al., 2018) for lemmatization and Part of Speech tagging. It is trained on (Nivre et al., 2018). MorphNet is a sequence-to-sequence recurrent neural network model used to produce a morphological analysis for each word in the input sentence. The model operates with a unidirectional Long Short Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) encoder to create a character-based word embeddings and a bidirectional LSTM encoder to obtain context embeddings. The decoder consists of two layers LSTM. The input to the MorphNet consists of an N word sentence S = [w 1 , . . . , w N ], where w i is the i'th word in the sentence. Each word is input as a sequence of characters w i = [w i1 , . . . , w iL i ], w ij ∈ A where A is the set of alphanumeric characters and L i is the number of characters in word w i .
The output for each word consists of a stem, a part-of-speech tag and a set of morphological features, e.g. "earn+Upos=verb+Mood=indicative+Tense=past" for "earned". The stem is produced one character at a time, and the morphological information is produced one feature at a time. A sample output for a word looks like [s i1 , . . . , where s ij ∈ A is an alphanumeric character in the stem, R i is the length of the stem, M i is the number of features, f ij ∈ T is a morphological feature from a feature set such as T = {Verb,Adjective,Mood=Imperative,Tense=Past, . . .}.
In Word Encoder we map each character w ij to an A dimensional character embedding vector a ij ∈ R A .The word encoder takes each word and processes the character embeddings from left to right producing hidden states We model context encoder by using a bidirectional LSTM. The inputs are the word embeddings e 1 , · · · , e N produced by the word encoder. The context encoder processes them in both directions and constructs a unique context embedding for each target word in the sentence. For a word w i I define its corresponding context embedding c i ∈ R 2H as the concatenation of the forward − → c i ∈ R H and the backward ← − c i ∈ R H hidden states that are produced after the forward and backward LSTMs process the word embedding e i . Figure illustrates the creation of the context vector for the target word earned.
The decoder is implemented as a 2-Layer LSTM network that outputs the correct tag for a single target word. By conditioning on the input embeddings and its own hidden state, the decoder learns to generate y i = [y i1 , . . . , y iK i ] where y i is the correct tag of the target word w i in sentence S, y ij ∈ A ∪ T represents both stem characters and morphological feature tokens, and K i is the total number of output tokens (stem + features) for word w i . The first layer of the decoder is initialized with the context embedding c i and the second layer is initialized with the word embedding e i .
We parameterize the distribution over possible morphological features and characters at each time step as where W s ∈ R |Y|×H and W sb ∈ R |Y| where Y = A ∪ T is the set of characters and morphological features in output vocabulary.

Word and Context Embeddings
We benefit pre-trained word embeddings from (Kırnap et al., 2017) in our parser. Both word and context embeddings are extracted from the language model described in section 3.1 of (Kırnap et al., 2017).

Features
We use limited number of continuous embeddings in parser model. These are POS, word, context, and morphological feature embeddings. Word and context embeddings are pre-trained and not finetuned during training. POS and morphological feature embeddings are randomly initialized and learned during training.
Abbrev Feature c context embedding v word embedding p universal POS tag f morphological features

Morphological Feature Embeddings
We introduce morphological feature embeddings, which differs from , as an additional input to our model. Each feature is represented with 128 dimensional continuous vector.
We experiment that vector sizes lower than 128 reduces the performance of a parser, and higher than 128 does not bring further enhancements. We formulate morphological feature embeddings by adding feature vectors of a word. For example, suppose we are given a word it with following morphological features: Case=Nom and Gen-der=Neut and Number=Sing and Person=3 and PronType=Prs. We basically sum corresponding 5 unique vectors to provide morphological feature embedding. However, our experiments suggest that not all languages benefit from morphological feature embeddings. (See section 4 for details)

Dependency Label Embeddings
Each distinct dependency label defined in CoNLL 2018 UD Shared Task represented with a 128 dimensional continuous vector. These vectors combined to construct hidden states in tree-RNN part of our model. We randomly initialize these vectors and learned during training.

ArcHybrid Transition System
We implement the ArcHybrid Transition System which has three components, namely a stack of tree fragments σ, a buffer of unused words β and a set A of dependency arcs, c = (σ, β, A). Stack is empty, there is no any arcs and, all the words of a sentence are in buffer initially. This system has 3 type of transitions: • shift(σ, b|β, A) = (σ|b, β, A) where | denotes concatenation and (b, d, s) is a dependency arc between b (head) and s (modifier) with label d. The system terminates parsing when the buffer is empty and the stack has only one word assumed to be the root.

Tree-stack LSTM
Tree-stack LSTM has 4 main components: buffer's β-LSTM, stack's σ-LSTM, actions'-LSTM and tree's tree-RNN or t-RNN in short. We aim to represent each component of the transition system, c = (σ, β, A), with a distinct LSTM similar to . Initial inputs to these LSTMs are embeddings obtained by concatenating the features explained in section 3.3. Our model differs from  by representing actions and dependency relations separately and including morphological feature embeddings. The transition system (see section 3.6 for details) is also different from theirs.
Buffer's β-LSTM is initialized with zero hidden state, and fed with input features from last word to the beginning. Similarly, stack's σ-LSTM is also initialized with zero hidden state and fed with input features from the beginning word to the last word of a stack. Actions' LSTM is also started with zero hidden state, and updated after each action. Inputs to σ-LSTM and β-LSTM are updated via tree-RNN.
We update either buffer's or stack's input embeddings based on parsing actions. For instance, suppose we are given β i a top word in buffer and σ i a final word in stack. The lef t d transition taken in current state. tree-RNN uses concatenation of previous embedding, σ i , and dependency relation embedding (explained in 3.5) as a hidden h t−1 .Input of a tree-RNN is a previous word em- bedding, β i . Output h t becomes a new word embedding for buffer's top word β i−new . Figure 5 depicts this flow. Similarly to left transition, right transition updates the stack's second top word. Hidden state of an RNN is calculated by concatenating stack's top word and dependency relation. There are 73 distinct actions for shift, labeled left and labeled right actions. We randomly initialize 128 dimensional vector for each labeled action and shift. These vectors become an input for action-LSTM shown in Figure 6. Concatenation of stack's LSTM, buffer's LSTM and actions' LSTM's final hidden layer becomes an input to MLP which outputs the probabilities for each transition in the next step.

Training
Our training strategy varies based on training data sizes. We divide datasets into 4 parts: 100k tokens or more, tokens in between 50k and 100k, and more than 20k less than 50k tokens.
For languages having more than 50k tokens in training data, we employ morphological featureembeddings as an additional input dimension (see Figure 2). However, for languages having tokens less than 50k we do not use this feature dimension. Finally we realize that the languages with more than 100k tokens, using morphological feature embeddings does not improve parsing performance but we use that additional feature dimension.
We use 5-fold cross validation for languages without development data. We do not change the LSTMs' hidden dimensions, but record the number of epochs took for convergence. The average of these epochs is used to train a model with whole training set.

Optimization and Hyper-Parameters
We conduct experiments to find best set of hyperparameters. We start with a dimension of 32 and increase the dimension by powers of two until 512 for LSTM hiddens, 1024 for LM matrix (explained in below). We report the best hyperparameters in this paper. Although the performance does not decrease after the best setting, we choose the minimum-best size not to sacrifice from training speed.
All the LSTMs and tree-RNN have hidden dimension of 256. The vectors extracted from LM having dimension of 950, but we reduce that to 512 by a matrix-vector multiplication. This matrix is also learned. We use Adam optimizer with default parameters. (Kingma and Ba, 2014).Training is terminated if the performance does not improve for 9 epochs.

Results
In this section we inspect our best/worst results and the conclusions we obtain during CoNLL 2018 UD Shared Task experiments. We submit our system to CoNLL 2018 UD Shared Task as "KParse" team. Our scoring is provided under the official CoNLL 2018 UD Shared Task website. 1 as well as in Table 4.1. All experiments are done with UD version 2.2 datasets (Nivre et al., 2018) and (Nivre et al., 2017) for training and testing respectively. The model improves performance by reducing hand-crafted feature selection. In order to analyze our tree-stack LSTM, we compare that model with Kırnap et al. sharing similar feature interests and transition system with our model. The difference between these two models is that Kırnap et al. based on handcrafted feature selection from state, e.g. number of left children of buffer's first word. However, treestack LSTM only needs raw features and previous parsing actions.
Our model comparatively performs better with languages less than 50k training tokens, e.g. sv lines and hu szeged and tr imst. However, when the number of training examples increases the performance improve slightly saturates, e.g. ar padt, en ewt. This may be due to convergence problems of our model. This conclusion   is also agrees with our official ranking in CoNLL 2018 UD Shared Task because our ranking in lowresource languages is 10, but general ranking is 16.
We next analyze the performance gain by including morphological features with languages training token in between 50k and 100k. As we deduce from  Table 3: Morphological feature embeddings in some languages having tokens more than 50k and less than 100k in training data performance enhancement with languages more than 100k training tokens.

Languages without Training Data
We have three criteria to choose a trained model for languages without training data. If there is a training corpus with the same language we use that as a parent. If there is no data from the same language, we pick a parent language from the same family. If there are more than one parent for a language, we select a parent with more training data.
We list our selections in Table 4.
Language Parent Language en pud en ewt ja modern ja gsd cs pud cs pdt sv pud sv talbanken fi pud fi tdt th pud id gsd pcm nsc en ewt br keb en ewt

Discussion
We use tree-stack LSTM model in transition based dependency parsing framework. Our main motivation for this work is to reduce the human designed features extracted from state components. Our results prove that the model is able to learn better than its predecessors. Moreover, we examine that the model performs better in languages with low resources compared in CoNLL 2018 UD Shared Task. We also constitute morphological feature embeddings which become useful for dependency parsing. All of our work is done in transition based dependency parsing, which sacrifices performance due to locality and non projectiveness. This study opens a question on adapting the tree-stack LSTM in graph based dependency parsing. Our code is publicly available at https://github.com/ kirnap/ku-dependency-parser2.