An Empirical Investigation of Structured Output Modeling for Graph-based Neural Dependency Parsing

In this paper, we investigate the aspect of structured output modeling for the state-of-the-art graph-based neural dependency parser (Dozat and Manning, 2017). With evaluations on 14 treebanks, we empirically show that global output-structured models can generally obtain better performance, especially on the metric of sentence-level Complete Match. However, probably because neural models already learn good global views of the inputs, the improvement brought by structured output modeling is modest.

In their pioneering work, besides the neural architecture, Dozat and Manning (2017) adopt a simple head-selection training object (Zhang et al., 2017) by regarding the original structured prediction task as an head-classification task in training. Although practically this simplification works well, there are still problems with it. Due to local normalization in the training objective (see §2.2), no global tree-structured information can be back-propagated during training. This can lead to the discrepancy between training and testing, since during testing, the MST (Maximum Spanning Tree) algorithm (McDonald et al., 2005b) is used to ensure valid tree structures. This problem raises concerns about the structured output layer. Several previous neural graph parsers utilized structured techniques (Pei et al., 2015;Kiperwasser and Goldberg, 2016;Zhang et al., 2016;Wang and Chang, 2016;Ma and Hovy, 2017), but their neural architectures might not be competitive to the current state-of-the-art BiAF parsing model. In this paper, building upon the BiAF based neural architecture, we empirically investigate the effectiveness of utilizing classical structured prediction techniques of output modeling for graph-based neural dependency parsing. We empirically show that structured output modeling can obtain better performance, especially on the the sentence-level metrics. However, the improvements are modest, probably because neural models make the problem easier to solve locally.

Output Modeling
In structured prediction tasks, a structured output y is predicted given an input x. We refer to the encoding of the x as input modeling, and the modeling of the structured output y as output modeling. Output modeling concerns modeling dependencies and interactions across multiple output components and assigning them proper scores. A common strategy to score the complex output structure is to factorize it into sub-structures, which is referred as factorization. A further step of normalization is needed to form the final score of an output structure. We will explain more details about these concepts in the situation of graph-based dependency parsing.

Factorization
The output structure of dependency parsing is a collection of dependency edges forming a singlerooted tree. Graph-based dependency parsers factorize the outputs into specifically-shaped subtrees (factors). Based on the assumption that the sub-trees are independent to each other, the score of the output tree structure (T ) is the combination of the scores of individual sub-trees in the tree.
In the simplest case, the sub-trees are the individual dependency edges connecting each modifier and its head word ((m, h)). This is referred to as first-order factorization (Eisner, 1996;McDonald et al., 2005a), which is adopted in (Dozat and Manning, 2017) and the neural parsing models in this work. There are further extensions to higherorder factors, considering more complex sub-trees with multiple edges (McDonald and Pereira, 2006;Carreras, 2007;Koo and Collins, 2010;Ma and Zhao, 2012). We leave the exploration of these higher-order graph models to future work.

Normalization
After obtaining the individual scores of the substructures, we need to compute the score of the whole output structure. The main question is on what scale to normalize the output scores. For graph-based parsing, there can be mainly three options: Global, Local or Single, following different structured output constraints and corresponding to different loss functions.
Global Global models directly normalize at the level of overall tree structures, whose scores are obtained by directly summing the raw scores of the sub-trees without any local normalization. This can be shown clearly if further taking a probabilistic CRF-like treatment, where a final normalization is performed over all possible trees: Here, the normalization is carried out in the exact output space of all legal trees (T 0 ). Max-Margin (Hinge) loss (Taskar et al., 2004) adopts the similar idea, though there is no explicit normalization in its formulation. The output space can be further constrained by requiring the projectivity of the trees (Kubler et al., 2009). Several manual-feature-based (McDonald et al., 2005b;Koo and Collins, 2010) and neural-based dependency parsers (Pei et al., 2015;Kiperwasser and Goldberg, 2016;Zhang et al., 2016;Ma and Hovy, 2017) utilize global normalization.
Local Local models, in contrast, ignore the global tree constraints and view the problem as a head-selection classification problem (Fonseca and Aluísio, 2015;Zhang et al., 2017;Dozat and Manning, 2017). The structured constraint that local models follow is that each word can be attached to one and only one head node. Based on this, the edge scores are locally normalized over all possible head nodes. This can be framed as the softmax output if taking a probabilistic treatment: In this way, the model only sees and learns headattaching decisions for each individual words. Therefore, the model is unaware of the global tree structures and may assign probabilities to non-tree cyclic structures, which are illegal outputs for dependency parsing. In spite of this defect, the local model enjoys its merits of simplicity and efficiency in training.
Single (Binary) If further removing the singlehead constraint, we can arrive at a more simplified binary-classification model for each single edge, referred as the "Single" model, which predicts the presences and absences of dependency relation for every pair of words. Eisner (1996) first used this model in syntactic dependency parsing, and Dozat and Manning (2018) applied it to semantic dependency parsing. Here, the score of each edge is normalized against a fixed score of zero, forming a sigmoid output: Here, we only show the scoring formula for brevity. In training, since this binary classification problem can be quite imbalanced, we only sample partial of the negative instances (edges). Practically, we find a ratio of 2:1 makes a good balance, that is, for each token, we use its correct head word as the positive instance and randomly sample two other tokens in the sentence as negative instances.

Summary
The normalization methods that we describe above actually indicate the output structured constraints   (Prob), which requires actual normalization over the output space, and Max-Margin Hinge loss (Hinge), which only requires lossaugmented decoding in the same output space. Table 1 summarizes the methods (normalization and loss function) that we investigate in our experiments. For global models, we consider both Projective (Proj) and Non-Projective (NProj) constraints. Specific algorithms are required for probabilistic loss (a variation of Inside-Outside algorithm for projective (Paskin, 2001) and Matrix-Tree Theorem for non-projective parsing (Koo et al., 2007;Smith and Smith, 2007;McDonald and Satta, 2007)) and hinge loss (Eisner's algorithm for projective (Eisner, 1996) and Chu-Liu-Edmonds' algorithm for non-projective parsing (Chu and Liu, 1965;Edmonds, 1967;McDonald et al., 2005b)). For Single and Local models, we only utilize probabilistic loss, since in preliminary experiments we found hinge loss performed worse. No special algorithms other than simple enumeration are needed for them in training. In testing, we adopt non-projective algorithms for the non-global models unless otherwise noted.

Settings
We evaluate the parsers on 14 treebanks: English Penn Treebank (PTB), Penn Chinese Treebank (CTB) and 12 selected treebanks from Universal Dependencies (v2.3) . We follow standard data preparing conventions as in Ma et al. (2018). Please refer to the supplementary material for more details of data preparation.
For the neural architecture, we also follow the settings in Dozat and Manning (2017) and Ma et al. (2018) and utilize the deep BiAF model. For the input, we concatenate representations of word, part-of-speech (POS) tags and characters. Word embeddings are initialized with the pre-trained fasttext word vectors 1 for all languages. For POS tags and Character information, we use POS embeddings and a character-level Convolutional Neural Network (CNN) for the encoding. For the encoder, we adopt three layers of bi-directional LSTM to get contextualized representations, while our decoder is the deep BiAF scorer as in Dozat and Manning (2017). We only slightly tune hyperparameters on the Local model and the development set of PTB, and then use the same ones for all the models and datasets. More details of hyperparameter settings are provided in the supplementary material. Note that our exploration only concerns the final output layer which does not contain any trainable parameters in the neural model, and all our comparisons are based on exactly the same neural architecture and hyper-parameter settings. Only the output normalization methods and the loss functions are different.
We run all the experiments with our own implementation 2 , which is written with PyTorch. All experiments are run with one TITAN-X GPU. In training, global models take around twice the time of the local and single models; while in testing, their decoding costs are similar.

Results
We run all the models three times with different random initialization, and the averaged results on the test sets are shown in Table 2. Due to space limitation, we only report LAS (Labeled Attachment Score) and LCM (Labeled Complete Match) in the main content. We also include the unlabeled scores UAS (Unlabeled Attachment Score) and UCM (Unlabeled Complete Match) in the supplementary material. The evaluations on PTB and CTB exclude punctuations 3 , while on UD we evaluate on all tokens (including punctuations) as the setting of the LAS metric in the CoNLL shared tasks (Zeman et al., 2017(Zeman et al., , 2018.  Overall, the global models 4 perform better consistently, especially on the metrics of Complete Match, showing the effectiveness of being aware of global structures. However, the performance gaps between global models and local models are small. More surprisingly, the single models that ignore all the structures only lag behind by around 0.4 averagely. In some way, this shows that input modeling, including the distributed input representations, contextual encoders and parts of the decoders, makes the structured decision problem easier to solve locally. Neural models seem to squeeze the improvement space that structured output modeling can bring.

Analysis
We further analyze on output constraints and input modeling. For brevity, we only analyze on PTB and use probabilistic models. Single models are excluded for their poorer performance.
Firstly, we study the influence of output constraint differences in training and testing. Here, we include a naive "Greedy" decoding algorithm which simply selects the most probable head for each token. This does not ensure that the outputs are trees and corresponds to the head-classification method adopted by local models. The results of different models and training/testing algorithms are shown in Figure 1. Interestingly, the discrepancies in training and testing are only detrimen-4 Projective global models perform averagely poorer than non-projective ones, since some of the treebanks (for example, only 88% of the trees in 'cs-pdt' are projective) contain a non-negligible portion of non-projective trees. tal when the output constraint in testing is looser than that in training (the left corner in the figure), as shown by the poorer results in the trainingtesting pairs of "NProj-Greedy", "Proj-Greedy" and "Proj-NProj". Generally, projective decoding is the best choice since PTB contains mostly (99.9%) projective trees.
Next, we study the interactions of "weaker" neural architectures (for input modeling) and output modeling. We consider three "weaker" models: (1) "No-Word" ignores all the lexical inputs and is a pure delexicalized model; (2) "Simple-CNN" replaces the RNN encoder with a much simpler encoder, which is a simple single-layer CNN with a window size of three for the purpose of studying weak models; (3) "No-Encoder" com- pletely deletes the encoder, leading to a model that does not take any contextual information. Here, since we are testing on PTB which almost contain only projective trees, we use projective decoding for all models. As shown in Figure 2, when input modeling is weaker, the improvements brought by the global model generally get larger. Here, the LCM for "No-Encoder" is an outlier, probably because this model is too weak to get reasonable complete matches. The results show that with weaker input modeling, the parser can generally benefit more from structured output modeling. In some way, this also indicates that better input modeling can make the problem depend less on the global structures so that local models are able to obtain competitive performance.

Discussion and Conclusion
In this paper, we call the models that are aware of the whole output structures "global". In fact, with the neural architecture that can capture features from the whole input sentence, actually all the models we explore have a "global" view of inputs. Our experiments show that with this kind of global input modeling, good results can be obtained even when ignoring certain output structures, and further enhancement of global output structures only provides small benefits. This might suggest that input and output modeling can capture certain similar information and have overlapped functionalities for the structured decisions.
In future work, there can be various possible extensions. We will explore more about the interactions between input and output modeling for structured prediction tasks. It will be also interesting to adopt even stronger input models, especially, those enhanced with contextualized representations from Elmo (Peters et al., 2018) or BERT (Devlin et al., 2018). A limitation of this work is that we only explore first-order graph based parser, that is, for the factorization part, we do not consider high-order sub-subtree structures. This part will surely be interesting and important to explore.