Initial Explorations of CCG Supertagging for Universal Dependency Parsing

In this paper we describe the system by METU team for universal dependency parsing of multilingual text. We use a neural network-based dependency parser that has a greedy transition approach to dependency parsing. CCG supertags contain rich structural information that proves useful in certain NLP tasks. We experiment with CCG supertags as additional features in our experiments. The neural network parser is trained together with dependencies and simplified CCG tags as well as other features provided.


Introduction
Combinatory Categorial Grammar (Steedman, 2000) (CCG) is widely used for natural language processing for its desirable properties of generative expressiveness and its transparent interface of syntax and underlying semantic interpretation. CCG has been used for creating fast and accurate parsers (Hockenmaier and Steedman, 2002), (Clark and Curran, 2007), (Auli and Lopez, 2011), (Lewis and Steedman, 2014). In addition to this, the structural information in the CCG categories, which is a lexicalised grammar, has been shown to improve performance of various other systems when used indirectly. Examples are multilingual dependency parsing and machine translation (Ambati et al., 2013a;Ç akıcı, 2008;Birch et al., 2007).
In this paper, we describe a system we created for CoNLL Shared Task of 2017  Multilingual Parsing from Raw Text to Universal Dependencies (Nivre et al., 2016(Nivre et al., , 2017b 1 . We use CCG categories induced from the CCGBank (Hockenmaier and Steedman, 2007) to 1 Results are announced at http://universaldependencies. org/conll17/results.html supertag different languages with these structural information-packed tags. We aim to show that CCG categories for English may be used to improve parsing results for other languages, especially similar ones.
In the next section, we give a brief background on the dependency parsing problem and CCG categories that have been shown to improve performance on various tasks either directly or indirectly. Section 3 gives the implementation details for the METU system and in Section 4 results are discussed.

Background
Combinatory Categorial Grammar is a lexicalised grammar formalism that has a transparent syntaxsemantics interface which means one can create rich semantic interpretations in parallel with parsing (Steedman, 2000). Several fast and highly accurate CCG parsers have been introduced in the literature. These parsers make use of the CCG-Bank (Hockenmaier and Steedman, 2002) that is created by inducing a CCG Grammar from the Penn Treebank (Marcus et al., 1993). The CCG categories of each word are extracted and referred to as supertags or CCG tags throughout the paper.
Different types of supertags such as the ones encoding predicate argument structure or morphosyntactic information have been shown to increase parsing performance in several studies starting with Bangalore and Joshi (1999). The importance of supertagging in parsing accuracy has been shown in various studies such as Falenska et al. (2015), Ouchi et al. (2014) and Foth et al. (2006) for different types of supertags such as combinations of dependency labels and dependent positions, and by Clark and Curran (2004) for CCG categories as supertags. The use of induced CCG grammar was also evaluated as an extrinsic model in Bisk et al. (2016). They show that although using full CCG derivation trees is superior, CCG lexicon-based grammars also increase performance in a semantics task.
CCG categories have been successfully used in parsing studies as external features and were shown to increase the performance. Note that this use of CCG categories does not allow us to fully access the power of CCG formalism but rather provides a way to use the rich structural information as a means of supertags. Ç akıcı (2008) first uses automatically-induced CCG categories from the Turkish treebank as extended (fine) tags (Buchholz and Marsi, 2006) in Mac-Donald's parser (McDonald et al., 2005). Then they were used for Hindi dependency parsing by Ambati et al. (2013a). Birch et al. (2007) and Nadejde et al. (2017) showed that statistical machine translation benefits from using structurally rich CCG categories (supertags/tags) in the source or target language. Chen and Manning (2014) proposed a neural network classifier for use in a greedy, transitionbased dependency parser. They created a three layer network in which the input layer is fed with word, POS tag and label embeddings, and after the feed forward step, the error is back-propagated to the input layer in order to tune embeddings. They randomly initialized the POS tag and label embeddings, however, as pre-trained word vectors, they used a combination of the embeddings in Collobert et al. (2011) and their trained 50 dimensional word2vec vectors. As this parser only learns and uses dense features with word representations, its parsing speed is at least two times faster than its closest opponent while also improving the accuracy by 2% for English and Chinese. Andor et al. (2016) also created a transition-based dependency parser based on neural networks and word embeddings after the Chen and Manning (2014) work. They proposed to use global instead of local model normalizations to overcome label bias problem with feed forward neural networks. Their parser achieved higher accuracies than former studies in English, Chinese and some other languages as stated in their results. Dozat and Manning (2016) created a neural network oriented graph based dependency parser. They used bi-affine classifiers to predict arcs and labels. They achieved state-of-the-art or competent accuracies on graph-based parsing for six languages. They improved LAS and UAS score by 1% from previous most accurate graph based parser.
Transition-based dependency parser created by (Kuncoro et al., 2016) is performing as current state-of-the-art with using recurrent neural network grammars. They also outperform most performing graph based parser with increasing attachment scores by almost 2% for English.

Method
Gold-standard CCG categories do not exist for languages except a few ones. In order to explore the effects of CCG supertagging on multilingual universal dependencies we assign simplified CCG-based supertags to the multilingual data by using dependency relations from English CCGbank (Hockenmaier and Steedman, 2007). We label training and development data sets for different languages using a tagger trained on English supertagged data and then we use the supertagged training and development data sets for each language in training CCG-based supertaggers and dependency parsers for test data sets in UD treebanks. Figure 1 shows the overall system that we use for the multilingual universal dependency parsing task . We train separate models for dependency parsing for each language.
Section 3.1 explains how we transfer CCGbased supertags to UD training and development data sets, Section 3.2 explains how we train a sequence tagger for assigning supertags to test data and finally Section 3.3 explains how the dependency parser is trained using these supertags.

Assigning CCG Categories Using Dependency Relations
Dependency relations and predicate-argument structures encoded in CCG categories are parallel most of the time, even though the parent-child directions are different in many cases. Figure  2 shows a sentence from PTB (and CCGbank) with dependency relations above the sentence and predicate-argument relations below the sentence. Many of the edges are symmetric in dependency relations and predicate-argument structure derived from the lexical categories. Figure 3 shows an enlarged view of a part of the sentence in Figure 2 in order to make the labels clearer. In order to supertag the data, first, the CCGbank   (Hockenmaier and Steedman, 2007) and the Penn Treebank (Marcus et al., 1993) are merged by converting PTB to CONLLU format using Stanford CoreNLP tool . We then aligned the tokens in both data sets so that CCG categories from CCGbank can be transfered to the MISC field in the CONLLU data.  A logistic regression classifier is trained to label the tokens in training and development sets using Scikit-Learn (Pedregosa et al., 2011). The classifier only uses universal part-of-speech tags and de-   to create a supertag. In Table 2, two adverbs are shown with different CCG categories in two languages. Note that the directions of the slashes in the original forms are different due to different word-order in these languages.. We use dependency relation labels between the word, its head, the head of the head, its dependents and the dependents of its head as features in the tagger. We also use universal part of speech tags for each one of the mentioned words. In English, the only language for which we have the gold standard CCG tags and dependency relations aligned, the re-tagging accuracy is 91.78%. If we additionally use extended part-of-speech tags and CCG argument positions as labels relative to (before or after) the current word, the accuracy increases to 93.55% on English for CCG categories with directional information. However, these are not universal in all languages, so we drop these features. Table 3 shows the features that are extracted for a sample word.
We run this tagger, trained on English CCG data, on every language to generate training and development files with CCG-based supertags for training supetaggers and dependency parsers that will work on test data.

CCG Sequence Tagging for Training
A supertagger for each language is built using CCG-based supertags transferred with depen-    The tagger tries to recover the supertags that are assigned using the dependency relations. On English, the recovery accuracy is 87.36%. Table 5 shows the accuracy of the supertagger on several languages. Note that these are not based on the correct CCG categories but on the assigned supertags via the tagger explained on Section 3.1. This tagger is used on test data preprocessed by UDpipe (Straka et al., 2016)

Dependency Parsing
Two different dependency parsers are experimented with in this study which are powered by two different techniques. First, CCG-based supertags are integrated into the maximum-spanning tree parser ()MSTParser) (McDonald et al., 2005; which is known as the first highperforming graph-based dependency parser. This parser uses discrete local features as input, and thus, the supertags are directly added to this set of features. This implementation shows us the effect of supertagging in a system where the similarities between the supertag groups are not captured semantically. The following parameters are used in all experiments unless otherwise stated: order = 2, loss type = no punctuation, decode type = projective. We use the Chen and Manning (2014) neural network-based dependency parser to observe how similarities in our supertags affect the model. This parser has been chosen as a baseline as it is the pioneer in using word embeddings for all the features in parsing process. Furthermore, it was given as the parser in the baseline system in Straka et al. (2016) that we compare our results with. In our parser CCG tags are represented with 100 dimensional vectors instead of discrete features. The supertags are obtained from the CCG-based supertagger described in the previous section. In experiments, the following parameters are used: embedding size = 100 and hidden layer size = 200. For word embeddings, pre-trained 100 dimensional word embeddings by  are used. For POS tags and supertags, vectors are initialized randomly and are fed to the neural network during training.

Results
First, we present our experiments before the release of the test data, then we present the results on the shared task.

Pre-evaluation
In MSTParser pre-evaluation experiments, we use the Penn Treebank Wall Street Journal segmentation split as sections 2-21 for training and section 23 as the test set. Extra training parameters are as the following: training − k = 5 loss − type = nopunc decode − type = proj order = 2 unless stated otherwise in the results. Detailed descriptions of these parameters can be found in . These parameters reproduce similar results as McDonald and Pereira (2006) which we use as a baseline to compare our improvements. The version in which the CCG supertags were added also uses the same configuration.  In the pre-evaluation Chen and Manning (2014)'s parser experiment, we also use the Penn Treebank for English as sections 2-21 for training, section 22 as development and section 23 as the test set. For word embedding file, GloVe 50 dimensional data is used (Pennington et al., 2014). Extra configurational parameters are: −maxIter : 20000 −trainingT hreads : 10 −embeddingSize : 50 where maxIter stands for maximum iteration step in neural network training, trainingT hreads for number of threads to use during training and embeddingSize for embedding vector size for words, POS tags and supertags.
Reproduction of the results from the original study and our results with our supertags are given in Table 7. These results are obtained by the evaluation method of the original parser.
Also, in the pre-evaluation phase, we test our Chen and Manning (2014) parser-based system on Turkish, German and French data. Shared taskprovided data is used for training and development purposes. Word embeddings are used as 100 dimensional vectors from . Except this difference, all other configuration param-  Table 7: Accuracy on (Chen and Manning, 2014) parser with CCG tags in English (pre-evaluation) eters are the same as in English experiments. Table 8 shows the results. As we see in the results, the tagger predicting the French supertags transferred from English data performs well since the two languages are similar. Also, German tags are inferrable as it is also close to English in grammatical structure. On the other hand, this is not relevant for Turkish, as grammatically, the two languages are quite different. Word order is one of the major differences between these two languages that might have affected the results.
Language pre-UAS pre-LAS post-UAS post-LAS

Shared Task
In Table 9, we give our shared task results with baselines. Here, we see a drop in accuracy compared to our pre-evaluation phase. The main differences between the systems are pre-trained embedding source and embedding sizes for words, POS tags and labels. In our development experiments, we believe, we are able to capture similarities between POS tags and supertags more ef-ficiently as the embedding size is smaller and the cardinalities of these groups are quite low. This also applies to word embeddings. Other than this, the training iteration count remains the same during experiments. This may be required as we increase the number of features and it needs to be tuned.
For each of the PUD treebanks, we select a model trained on the same language. For some of the languages for which we cannot train a parser, either due to lack of training data or word vectors, and also for the surprise languages, we selected a similar language in the same language family.   Table 11 shows the LAS score of the system on different categories of treebanks as reported in the Shared task paper. All treebanks are shown with the official macro-averaged LAS F1 score. As expected, the system performs better on big treebanks where there are more data instances. PUD treebanks have a big treebank in the same language, therefore the results are close. The difference between our system and the best performing ones is bigger for the small treebanks. The reason for this is that we only train on the language dataset itself and do not use data from other languages for the small treebanks. This causes sparsity issues with the supertagger and the parser. We try to use model trained on a similar language for the surprise languages, however this does not result in a reasonable accuracy since the lexicons are usually very different even if the languages are from the same family and the supertagging relies mostly on the lexical entries and features extracted from them since we do not have the dependency information while decoding. POS tagging error that propagates through the pipeline also affects the performance.   Table 12 shows the results of our system on TIRA (Potthast et al., 2014) evaluations of UD test data for each language (Nivre et al., 2017a).

Conclusion and Future Work
We experimented with the effects of introducing CCG-based supertags on multilingual universal dependency parsing by taking a radical approach of transferring CCG categories from English to other languages. We used the similarities in dependency formalism and the universal POS tags in order to create CCG lexicons for each language included in the shared task. Since this is a diverse set of languages from different language families and different structural and orthographic properties this transferring is not ideal for many languages.
The existing CCG lexicons for languages such as Turkish were not used for the task since the universal dependency release of Turkish and the dependency treebank the Turkish CCG lexicon were induced from are not aligned on word/tokenization level. Therefore we could not provide accuracy on that data.
We hypothesise that using CCG lexicons from a different language family, especially one with a different word order, may increase the performance of the supertaggers since English only covers a small subset of syntactic properties in a diverse set of languages.  After a manual alignment and tagging procedure, the Turkish data can be used in both training and evaluation. We can also group the languages and use similar families of languages to train a common system for them in the future.
One important addition to future work is to induce the CCG supertags in each language including the smaller datasets similar to the approaches used in (Ambati et al., 2013b,a;Ç akıcı, 2008) and use these tags in our experiments. We believe adding the specific directional information in CCG categories will help in making more use of the information in the potentially very rich supertags.
Combining two or more CCG lexicons and tagging with the combined model might also be an interesting experiment.
For parsing, we plan to experiment with different word embedding sizes and tune the deep learning parameters. Other than these, we will experiment over neural networks integrated into graphbased dependency parsers. In future, we are planning to use pre-trained POS and CCG tag embeddings in our experiments. If these embeddings can be extracted from corpora on all available tags, this will reduce training time and increase parsing accuracy.