A Morphology-Based Representation Model for LSTM-Based Dependency Parsing of Agglutinative Languages

We propose two word representation models for agglutinative languages that better capture the similarities between words which have similar tasks in sentences. Our models highlight the morphological features in words and embed morphological information into their dense representations. We have tested our models on an LSTM-based dependency parser with character-based word embeddings proposed by Ballesteros et al. (2015). We participated in the CoNLL 2018 Shared Task on multilingual parsing from raw text to universal dependencies as the BOUN team. We show that our morphology-based embedding models improve the parsing performance for most of the agglutinative languages.


Introduction
This paper describes our submission to the CoNLL 2018 Shared Task (Zeman et al., 2018b) on parsing of Universal Dependencies (UD) (Nivre et al., 2016). We propose morphologically enhanced character-based word embeddings to improve the parsing performance especially for agglutinative languages. We apply our approach to a transitionbased dependency parser by  that uses stack Long Short Term Memory structures (LSTMs) to predict the parser state. This parser uses character-level word representation, which has been shown to perform better for languages with rich morphology Dozat et al., 2017). From our experiment results performed on UD version 2.2 data sets Zeman et al., 2018a) we observe that including morphological information to a character-based word embedding model yields a better learning of relationships between words and increases the parsing performance for most of the agglutinative languages with rich morphology.
The rest of the paper is organized as follows: Section 2 provides a brief description of the LSTM-based dependency parser used in this study and introduces our embedding models. Section 3 gives the implementation details of our system and describes the training strategies we apply to different languages. Section 4 discusses our results on the shared task as well as the post-evaluation experiments and Section 5 concludes the paper.

Parsing Model
We use the LSTM-based parser by . It is an improved version of a stateof-the-art transition-based dependency parser proposed by  and uses stack LSTM structures with push and pop operations to learn representations of the parser state. Instead of lookup-based word representations, bidirectional LSTM modules are used to create character-based encodings of words. With this character-based modelling, the authors obtain improvements on the dependency parsing of many morphologically rich languages.

Character Embeddings of Words
The character-based word embedding model using bi-LSTMs in  is depicted in Figure 1. The authors compute character-based vector representations of words using bi-LSTMs. Their embedding system reads each word character by character from the beginning to the end and computes an embedding vector of the character sequence, which is denoted as w in Figure 1. The system also reads the word character by character from the end to the beginning and the produced embedding is denoted as ← − w . These two embedding vectors and the learned representation of the POS-tag of the word t are concatenated to produce the vector representation of the word. A linear mapping of POS-tag words to integers is used to create a representation of the POS tags as in .
Figure 1: Vector represetntation of the word travel with the character-based embedding model in .

Morphology-based Character Embeddings
To improve the parsing performance of the LSTM parser with character-based word embeddings mentioned in Section 2.1, we include the morphological information of words to the embedding model. In agglutinative languages like Turkish, a stem usually takes different suffixes and by this way, different meanings are created using a single root-word. Words that share the same suffixes tend to have similar roles in a sentence. For instance, gerunds in Turkish are a kind of derivational suffixes. Verbs that take the same gerund as a suffix have usually the same role in sentences. Table  1 shows some statistics of verbs with gerunds in the development data set of Turkish-IMST treebank for demonstration purposes. The first column shows some example suffixes that attach to verbs and turn them to adverbs. The second column shows the number of verbs with the corresponding suffix in the development set. The third column shows the statistics of the dependency labels of these verbs. As it can be seen from the table, these suffixes help determining the role of the word they attach to. Therefore, representing each word using its corresponding lemma and suffixes separately and utilizing the morphological infor-mation of words can improve the parsing performance in agglutinative languages.
Figure 2: Vector representation of the word travel with the character-based embedding model in .

Lemma-Suffix Model
For agglutinative languages where the stem of a word does not change in different word forms, we created a model that uses lemma and suffix information of words in character-based embeddings. In this model, each word is separated to its lemma and suffixes. Then, the embedding system first reads the lemma of the word character by character from the beginning to the end and computes an embedding vector of the character sequence of the lemma which is denoted as r. Secondly, the system reads the lemma character by character from the end to the beginning and the produced embedding is denoted as ← − r . A similar process is performed for the suffixes of the word and the produced vectors are denoted as s and ← − s . These four embedding vectors and the vector representation of the POS-tag of the word t are then concatenated to produce the vector representation of the word. POS-tag representations are created by linearly mapping the POS-tag words to integers as in . Vector representation of an example word using this model is depicted in Figure 3.

Morphological Features Model
The lemma-suffix model is suitable only for agglutinative languages which make use of suffixes to create different word forms. For languages that do not have this type of grammar, we created another model where the specific morphological fea-  Table 1: Number of occurrences of some example suffixes and the corresponding dependency labels of verbs with these suffixes in the development data of Turkish-IMST treebank. tures of each word are embedded to the dense representations of the words. The reason behind this choice is that some morphological features have a direct impact in identifying the dependency labels of words. For instance, if a word has a case feature and its value is accusative, then it is usually an object of the sentence. By extracting and utilizing such morphological features, we can improve the parsing accuracy for languages that suit this model. In this model, the embedding of a word is created character by character as in Section 2.1. Then, the embedding vector of each of its selected morphological features are created by reading the feature value character by character from the beginning to the end. Finally, these embedding vectors and the vector representation of the POS-tag of the word are concatenated to produce the vector representation of the word.
The vector representation of an example word using its morphological features is shown in Fig

Implementation
The systems participating in the CoNLL 2018 Shared Task on UD Parsing are expected to parse raw text without any gold-standard pre-processing operations such as tokenization, lemmatization, and morphological analysis. However, the baseline pre-processed versions of the raw text by the UDPipe system (Straka et al., 2016) are available for the participants who want to focus only on the dependency parsing task. We used the automatically annotated version of the corpora provided by UDPipe, since our primary aim is to observe the effect of our embedding models on the dependency parsing of agglutinative languages.  Table 2: Lemma and suffix separation example without using morphological analyzer and disambiguator and with using morphological analyzer and disambiguator on the Turkish sentence "Her şeydenönce sanatçıydı." (English meaning: "She was an artist before anything else.") In the implementation of the lemma-suffix embedding model, we did not utilize any morphological analyzer and disambiguator tools to find the lemmas and the suffixes of the words. Instead, for each word in the treebank we extracted its corresponding lemma information from the conll-u ver- Figure 4: Character-based word embedding of a German word war ("was" in English) with its morphological features being M ood = Ind|N umber = Sing|P erson = 3|T ense = P ast|V erbF orm = F in using morphological features embedding model. The selected features for German are Case, Mood, Tense, and VerbForm. Since there is no Case feature in the morphological features of war, the Case feature is represented with an empty string in the word vector of war.
sion of the treebank data and subtracted the lemma from the word to find the suffix information. We compared these two approaches on the Turkish-IMST treebank. For this purpose, we utilized the Turkish morphological parser and disambiguator by (Sak et al., 2008). A comparison between the two approaches is shown on an example sentence in Table 2. We observed that finding the suffixes by subtracting the lemmas from the words gives the same parsing performance as using a morphological analyzer tool to find the lemma and suffixes of a word. So, we opted not to use a morphological analyzer and disambiguator for the languages with the lemma-suffix embedding model due to the additional costs of these tools.

Embedding Model Selection for Different Languages
We applied the lemma-suffix model in 2.2 to Buryat, Hungarian, Kazakh, Turkish, and Uyghur languages because these languages have agglutinative morphology, take suffixes, and the stem of a word usually does not change in different word forms. We also applied this model to Danish to observe the effect in parsing performance of a language with little inflectional morphology. For the languages that do not follow this scheme, we applied the morphological features embedding model in 2.2. Table 3 shows the morphological features selected for these languages in the shared task. We selected four morphological features from the input conll-u files for most of the languages. For French, Indonesian, and Old French, we used less than four features because there are less than four common morphological features in the conll-u files of these languages.
For Persian, Japanese, Korean, Vietnamese, and Chinese, we used the baseline embedding model due to the lack of representative morphological features in their corresponding conll-u files.

Languages without Training Data
We trained a mixed language parser model with morphological features embedding model for the languages with no training data. For training this parser model, we used the mixed language training data supplied by the organizers of the shared task. This data is created by including the first 200 sentences of each treebank. In the shared task, this model is applied to the Buryat-KEB, Czech-PUD, English-PUD, Faroese-OFT, Japanese-Modern, Naija-NSC, Swedish-PUD, and Thai-PUD treebanks.
We trained parser models for the Upper Sorbian-UFAL and Galician-TreeGal treebanks using the morphological features embedding model and for the Buryat-BDT treebank using the lemma-suffix embedding model. However, we used the mixed language parser model for these treebanks in the shared task due to some software issues.

Training Specifications
Our model mostly uses the same hyper-parameter configuration with the original settings of the parser in  with a few exceptions. We used stochastic gradient descent trainer with a learning rate of 0.13. We replaced  We adapted it to be able to take input and produce output in conll-u format. The source code of our modified version of the LSTM-based parser by  can be found at https://github. com/CoNLL-UD-2018/BOUN.
A full run over the 82 test sets takes about 3 hours when no pre-trained embeddings are used, and 20 hours when the CoNLL-17 pre-trained word embeddings from (Ginter et al., 2017) are used on the TIRA virtual machine (Potthast et al., 2014). The largest of the test sets needs 4 GB memory without pre-trained word vectors. When the CoNLL-17 pre-trained vectors are used, memory usage can reach to 32 GB depending on the pre-trained vector sizes.

Results
This section presents the parsing performance of our parser models on the CoNLL-18 Shared Task as well as the post-evaluation scores of our models. Task   Table 4 shows our official LAS, MLAS, and BLEX results in the CoNLL-18 Shared Task. The models that use the CoNLL-17 pre-trained word embeddings from (Ginter et al., 2017) are indicated in pre-trained vectors column. We also trained parser models using pre-trained word embeddings for Czech-PDT, German-GSD, English-EWT, English-GUM, English-LinES, Spanish-AnCora, Indonesian-GSD, Latvian-LVTB, Swedish-LinES, Swedish-Talbanken, and Turkish-IMST. However, we could not run these models with their corresponding embedding files inside the TIRA virtual machine due to some unknown memory and disk issues.

Shared
Although the parser we used does not obtain competitive performance when compared with the best performing systems in the shared task, it achieves better performance on the treebanks with no training data when compared to its performance on treebanks with training data. We exclude the parallel UD treebanks from this judgment because one can get better performance on parallel UD treebanks by training the parser using the training data of the treebanks that have the same language with the parallel UD treebanks (e.g., the training data of English-EWT for English-PUD, Czech-PDT for Czech-PUD etc.). Due to timeconstraints, we did not focus on the parallel UD treebanks and treated them as unknown languages.

Post-Evaluation
We performed another set of experiments using our models on the test data of UD version 2.2 data sets. The purpose of these experiments is to investigate the effect of our embedding models on parsing performance. Here we used the gold-standard conll-u files instead of the automatically annotated corpora by UDPipe, since our aim in these experiments is to observe the performance difference between our embedding models and the baseline embedding model.
In Table 5, we compare our models with the baseline model proposed in . Due to time constraints, we trained all models without pre-trained word embeddings.
From the comparative results shown in Table  5, we observe that on the languages that have rich inflectional and derivational processes mostly by adding suffixes to words, our morphological features model outperforms the baseline model in terms of parsing scores. This is the case for the Bulgarian, Croatian, Czech, Basque, Gothic, Latin, Polish, Russian, Slovak, Slovene, North Sami, and Ukrainian languages. The morphological features model is not suitable for the grammatical structure of Arabic, which has derivational morphology and it also fails to outperform the baseline in Romanic languages like French, Spanish, Catalan, Galician, and Portuguese. The possible reason behind this failure might be the analytic structure of the grammar of these languages where every morpheme is     an independent word. English, Hebrew, Hindi and Urdu languages are also categorized as mostly analytic languages which do not use inflections and have a low morpheme-per-word ratio (Moravcsik, 2013). Dutch, Norwegian, and Swedish languages have a very simplified inflectional grammar. So, these languages are not represented well using our morphology-based embedding models. Besides, our model is not the best choice for the languages that have high ratio of morphophonological modifications to the root word like Old Church Slavonic.
The lemma-suffix embedding model is applied to the Danish, Hungarian, Kazakh, Turkish, and Uyghur languages. The best performance is reached in the Hungarian language with more than 4% increase in LAS score. Our model outperforms the baseline in Turkish too. These languages are highly agglutinative languages where words may consist of several morphemes and the boundaries between morphemes are clearcut. In this type of languages, there is a one-to-one formmeaning correspondence and shape of a morpheme is invariant (Moravcsik, 2013). An example word-morpheme relationship in Hungarian and Turkish languages is shown in Table 6. As it can be seen from the table, this structure is very suitable to the lemma-suffix embedding model. However, the lemma-suffix model fails to reach better performance than the baseline system on the Kazakh and Uyghur treebanks. A possible reason might be that our embedding model increases the complexity of the system unnecessarily for these languages with very little training data. Although Danish can be considered as an analytic language with a simplified inflectional grammar, the lemmasuffix model outperforms the baseline for this lan-guage. Table 7 shows the parsing scores of the parser with lemma-suffix embedding model on the test data of Turkish-IMST treebank version 2.2. We compared the parsing performances when the parser does not use pre-trained word embeddings, when it uses pre-trained embeddings from CoNLL-17 UD word embeddings, and when it uses pre-trained embeddings from word vectors trained on Wikipedia by Facebook (Bojanowski et al., 2017). From the results, we observe that the usage of pre-trained word vectors increases the parsing performance by great extent for Turkish. We also observe that Facebook word vectors outperform the CoNLL-17 UD word vectors, although the number of words in the Facebook vectors data set is much smaller than the number of words in the CoNLL-17 UD word vectors data set.

Conclusion
We introduced two morphology-based adaptations of the character-based word embedding model in  and experimented with these models on the UD version 2.2 data set. The experiment results suggest that our models utilizing morphological information of words increases the parsing performance in agglutinative languages.