Transition-based dependency parsing with topological fields

The topological ﬁeld model is commonly used to describe the regularities in German word order. In this work, we show that topological ﬁelds can be predicted reliably using sequence labeling and that the predicted ﬁeld labels can inform a transition-based dependency parser.


Introduction
The topological field model (Herling, 1821;Erdmann, 1886;Drach, 1937;Höhle, 1986) has traditionally been used to account for regularities in word order across different clause types of German. This model assumes that each clause type contains a left bracket (LK) and a right bracket (RK), which appear to the left and the right of the middle field (MF). Additionally, in a verb-second declarative clause, the LK is preceded by the initial field (VF) with the RK optionally followed by the final field (NF). 1 Table 1 gives examples of topological fields in verb-second declarative (MC) and verb-final relative (RC) clauses.
Certain syntactic restrictions can be described in terms of topological fields. For instance, only a single constituent is typically allowed in the VF, while multiple constituents are allowed in the MF and the NF. Many ordering preferences can also be stated using the model. For example, in a main clause, placing the subject in the VF and the direct object in the MF is preferred over the opposite order.
In parsing, topological field analysis is often seen as a task that is embedded in parsing itself. For instance, Kübler (2005), Maier (2006), and Cheung and Penn (2009) train PCFG parsers on 1 The abbreviations are derived from the German terms linke Klammer, rechte Klammer, Mittelfeld, Vorfeld, and Nachfeld. treebanks that annotate topological fields as interior nodes. It is perhaps not surprising that this approach works effectively for phrase structure parsing, because topological fields favor annotations that do not rely on crossing or discontinuous dependencies (Telljohann et al., 2006).
However, the possible role of topological fields in statistical dependency parsing (Kübler et al., 2009) has not been explored much. We will show that statistical dependency parsing of German can benefit from knowledge of clause structure as provided by the topological field model.

Motivation and corpus analysis
Transition-based dependency parsers (Nivre, 2003;Kübler et al., 2009) typically use two transitions (LEFT ARC and RIGHT ARC) to introduce a dependency relation between the token that is on top of the processing stack and the next token on the buffer of unprocessed tokens. The decision to make an attachment, the direction of attachment, and the label of the attachment is made by a classifier. Consequently, a good classifier is tasked to learn syntactic constraints, ordering preferences, and selectional preferences.
Since transition-based dependency parsers process sentences in one deterministic linear-time left-to-right sweep, the classifier typically has little global information. One popular approach for reducing the effect of early attachment errors is to retain some competition between alternative parses using a globally optimized model with beam search (Zhang and Clark, 2008). Beam search presents a trade-off between speed (smaller beam) and higher accuracy (larger beam). More recently, Dyer et al. (2015) have proposed to use Long short-term memory networks (LSTMs) to maintain (unbounded) representations of the buffer of unprocessed words, previous parsing ac- tions, and constructed tree fragments. We believe that in the case of German, the topological field model can provide a linguisticallymotivated approach for providing the parser with more global knowledge of the sentence structure. More concretely, if we give the transition classifier access to topological field annotations, it can learn regularities with respect to the fields wherein the head and dependent of a particular dependency relations lie.
In the remainder of this section, we provide a short (data-driven) exploration of such regularities. Since there is a myriad of possible triples 2 consisting of relation, head field, and dependent field, we will focus on dependency relations that virtually never cross a field and relations that nearly always cross a field. Table 2 lists the five dependency relation that cross fields the least often in the TüBa-D/Z treebank (Telljohann et al., 2006;Versley, 2005) of German newspaper text. Using these statistics, a classifier could learn hard constraints with regard to these dependency relations -they should never be used to attach heads and dependents that are in different fields.

Dependency label
Cross-field (%) Particles 0.00 Determiner 0.03 Adjective or attr. pronoun 0.04 Prepositional complement 0.04 Genetive attribute 0.07 Table 2: The five dependency relations that most rarely cross fields in the TüBa-D/Z. Table 3 lists the five dependency relations that cross fields most frequently. 3 These relations (virtually) always cross fields because they are verbal attachments and verbs typically form the LK and RK. This information is somewhat informative, since a classifier should clearly avoid to attach tokens within the same field using one of these relations. However, we can gain more interesting insights by looking at the dependents' fields.

Dependency label
Cross-field (%) Expletive es 100.00 Separated verb prefix 100.00 Subject 100.00 Prepositional object 99.80 Direct object 99.51 Table 3: The five dependency relations that most frequently cross fields in the TüBa-D/Z. Table 4 enumerates the three (where applicable) most frequent head and dependent field combinations of the five relations that always cross fields. As expected, the head is always in the LK or RK. Moreover, the dependents are in VF or MF in the far majority of cases. The actual distributions provides some insights with respect to these dependency relations. We will discuss the direct object, prepositional object, and separated verb prefix relations in some more detail.
Direct objects In German, direct objects can be put in the VF. However, we can see that direct object fronting only happens very rarely in the TüBa-D/Z. This is in line with earlier observations in corpus-based studies (c.f. Weber and Müller (2004)). Since the probability of having a subject in the VF is much higher, the parser should attach the head of a noun phrase in the VF as a subject, unless there is overwhelming evidence to the contrary, such as case markers, verb agreement, or other cues (Uszkoreit, 1984;Müller, 1999).
Prepositional objects The dependency annotation scheme used by the TüBa-D/Z makes a distinction between prepositional phrases that are a required complement of a verb (prepositional objects) and other prepositional phrases. Since a statistical dependency parser does not typically have access to a valency dictionary, it has difficulty de-  ciding whether a prepositional phrase is a prepositional object or not. Topological field information can complement verb-preposition co-occurrence statistics in deciding between these two different relations. The prepositional object mainly occurs in MF, while a prepositional phrase headed by the LK is almost as likely to be in the VF as in the MF (42.12% and 55.70% respectively).
Separated verb prefixes Some verbs in German have separable prefixes. A complicating factor in parsing is that such prefixes are often words that can also be used by themselves. For example, in (1-a) fest is a separated prefix of bindet (present tense third person of festbinden), while in (1-b) fest is an optional adverbial modifier of gebunden (the past participle of binden).
(1) a. Sie Similarly to prepositional objects, a statistical parser is handicapped by not having an extensive lexicon. Again, topological fields can complement co-occurence statistics. In (1-a), fest is in the RK.
As we can see in Table 4, the separated verb prefix is always in the RK. In contrast, an adverbial modifier as in (1-b) is rarely in the RK (0.35% of the adverbs cases in the TüBa-D/Z).

Predicting fields
As mentioned in Section 1, topological field annotation has often been performed as a part of phrase structure parsing. In order to test our hypothesis that topological field annotation could inform dependency parsing, it would be more appropriate to use a syntax-less approach. Several shallow approaches have been tried in the past. For instance, Veenstra et al., (2002) compare three different chunkers (finite state, PCFG, and classification using memory-based learning). Becker and Frank (2002) predict topological fields using a PCFG specifically tailored towards topological fields. Finally, Liepert (2003) proposes a chunker that uses support vector machines.
In the present work, we will treat the topological field annotation as a sequence labeling task. This is more useful in the context of dependency parsing because it allows us to treat the topological field as any other property of a token.
Topological field projection In order to obtain data for training, validation, and evaluation, we use the TüBa-D/Z treebank. Topological fields are only annotated in the constituency version of the TüBa-D/Z, where the fields are represented as special constituent nodes. To obtain token-level field annotations for the dependency version of the treebank, we project the topological fields of the constituency trees on the tokens. The recursive projection function for projection is provided in Appendix B. The function is initially called with the root of the tree and a special unknown field marker, so that tokens that are not dominated by a topological field node (typically punctuation) also receive the topological field feature.
We should point out that our current projection method results in a loss of information when a sentence contains multiple clauses. For instance, an embedded clause is in a topological field of the main clause, but also has its own topological structure. In our projection method, the topological field features of tokens in the embedded clause reflect the topological structure of the embedded clause.
Model Our topological field labeler uses a recurrent neural network. The inputs consist of concatenated word and part-of-speech embeddings. The embeddings are fed to a bidirectional LSTM (Graves and Schmidhuber, 2005), on which we stack a regular LSTM (Hochreiter and Schmidhu-ber, 1997), and finally an output layer with the softmax activation function. The use of a recurrent model is motivated by the necessity to have long-distance memory. For example, (2-a) consists of a main clause with the LK wird and RK begrünt and an embedded clause wie geplant with its own clausal structure. When the labeler encounters jetzt, it needs to 'remember' that it was in the MF field of the main clause.

Parsing with topological fields
To evaluate the effectiveness of adding topological fields to the input, we use the publicly available neural network parser described by De Kok (2015). This parser uses an architecture that is similar to that of Chen and Manning (2014). However, it learns morphological analysis as an embedded task of parsing. Since most inflectional information that can be relevant for parsing German is available in the prefix or suffix, this parser learns morphological representations over character embeddings of prefixes and suffixes. We use the same parser configuration as that of De Kok (2015), with the addition of topological field annotations. We encode the topological fields as one-hot vectors in the input of the parser. This information is included for the four tokens on top of the stack and the next three tokens on the buffer.

Evaluation and results
To evaluate the proposed topological field model, we use the same partitioning of TüBa-D/Z and the word and tag embeddings as De Kok (2015). For training, validation, and evaluation of the parser, we use these splits as-is. Since we want to test the parser with non-gold topological field annotations as well, we swapped the training and validation data for training our topological field predictor.
The parser was trained using the same hyperparameters and embeddings as in De Kok (2015). Our topological field predictor is trained using Keras (Chollet, 2015). 4 The hyperparameters that we use are summarized in Appendix A. The topological field predictor uses the same word and tag embeddings as the parser.
In Table 5, we show the accuracy of the topological field labeler. The use of a bi-directional LSTM is clearly justified, since it outperforms the stacked unidirectional LSTM by a wide margin.

Parser
Accuracy (%) LSTM + LSTM 93.33 Bidirectional LSTM + LSTM 97.24 Table 5: Topological field labeling accuracies. The addition of backward flowing information improves accuracy considerably. Table 6 shows the labeled attachment scores (LAS) for parsing with topological fields. As we can see, adding gold topological field annotations provides a marked improvement over parsing without topological fields. Although the parser does not achieve quite the same performance with the output of the LSTM-based sequence labeler, it is still a relatively large improvement over the parser of De Kok (2015). All differences are significant at p < 0.0001. 5 De Kok (2015) 89.49 91.88 Neural net + TFs 90.00 92.36 Neural net + gold TFs 90.42 92.76 Table 6: Parse results with topological fields and gold topological fields. Parsers that use topological field information outperform parsers without access to such information.

Result analysis
Our motivation for introducing topological fields in dependency parsing is to provide the parser with a more global view of sentence structure (Section 2). If this is indeed the case, we expect the parser to improve especially for longer-distance relations. Figure 1 shows the improvement in LAS as a result of adding gold-standard topological fields. We see a strong relation between the relation length and the improvement in accuracy. The introduction of topological fields clearly benefits the attachment of longer-distance dependents. Since the introduction of topological fields has very little impact on short-distance relations, the differences in the attachment of relations that virtually never cross fields (Table 2) turn out to be negligable. However, for the relations that cross fields frequently, we see a marked improvements (Table 7) for every relation except the prepositional object. In hindsight, this difference should not be surprising -the relations that never cross fields are usually very local, while those that almost always cross fields tend to have longer distances and/or are subject to relatively free ordering.   The ten dependency relations with the highest overall improvement in LAS are shown in Table 8. Many of these relations are special when it comes to topological field structure and were not discussed in Section 2. The relations parenthesis, dependent clause, and sentence link two clauses; the sentence root marks the root of the dependency tree; and the coordinating conjunction (clausal) relation attaches a token that is always in its own field. 6 This confirms that the addition of topological fields also improves the analysis of the overall clausal structure.

Conclusion and outlook
In this paper, we have argued and shown that access to topological field information can improve the accuracy of transition-based dependency parsers. In future, we plan to see how competitive the bidirectional LSTM-based sequence labeling approach is compared to existing approaches. Moreover, we plan to evaluate the use of topological fields in the architecture proposed by Dyer et al., (2015) to see how many of these regularities that approach captures.

A Hyperparameters
The topological field labeler was trained using Keras (Chollet, 2015). Here, we provide a short overview the hyperparameters that we used: • Solver: rmsprop, this solver is recommended by the Keras documentation for recurrent neural networks. The solver is used with its default parameters.
• Learning rate: the learning rate was determined by the function 0.01(1 + 0.02 i ) −2 , where i is the epoch. The intuition was to start with some epochs with a high learning rate, dropping the learning rate quickly. The results were not drastically different when using a constant learning rate of 0.001.
• Epochs: The models was trained for 200 epochs, then we picked the model of the epoch with the highest performance on the validation data (27 epochs for the unidirectional LSTM, 124 epochs for the bidirectional LSTM).
• LSTM layers: all LSTM layers were trained with 50 output dimensions. Increasing the number of output dimensions did not provide an improvement.
• Regularization: 10% dropout (Srivastava et al., 2014) was used after each LSTM layer for regularization. A stronger dropout did not provide better performance.

B Topological field projection algorithm
Algorithm 1 Topological field projection. function PROJECT(node,field) if IS TERMINAL NODE(node) then node.field ← field else if IS TOPO NODE(node) then field ← node.field end if for child ∈ node do PROJECT(child,field) end for end if end function