Dependency Parsing with Dilated Iterated Graph CNNs

Dependency parses are an effective way to inject linguistic knowledge into many downstream tasks, and many practitioners wish to efficiently parse sentences at scale. Recent advances in GPU hardware have enabled neural networks to achieve significant gains over the previous best models, these models still fail to leverage GPUs’ capability for massive parallelism due to their requirement of sequential processing of the sentence. In response, we propose Dilated Iterated Graph Convolutional Neural Networks (DIG-CNNs) for graph-based dependency parsing, a graph convolutional architecture that allows for efficient end-to-end GPU parsing. In experiments on the English Penn TreeBank benchmark, we show that DIG-CNNs perform on par with some of the best neural network parsers.


Introduction
By vastly accelerating and parallelizing the core numeric operations for performing inference and computing gradients in neural networks, recent developments in GPU hardware have facilitated the emergence of deep neural networks as state-ofthe-art models for many NLP tasks, such as syntactic dependency parsing. The best neural dependency parsers generally consist of two stages: First, they employ a recurrent neural network such as a bidirectional LSTM to encode each token in context; next, they compose these token representations into a parse tree. Transition based dependency parsers (Nivre, 2009;Chen and Manning, 2014;Andor et al., 2016) produce a well-formed tree by predicting and executing a series of shiftreduce actions, whereas graph-based parsers (Mc- Darker cell indicates more layers include that cell's representation. Heads and labels corresponding to gold tree are indicated. Donald et al., 2005;Kiperwasser and Goldberg, 2016;Dozat and Manning, 2017) generally employ attention to produce marginals over each possible edge in the graph, followed by a dynamic programming algorithm to find the most likely tree given those marginals.
Because of their dependency on sequential processing of the sentence, none of these architectures fully exploit the massive parallel processing capability that GPUs possess. If we wish to maximize GPU resources, graph-based dependency parsers are more desirable than their transitionbased counterparts since attention over the edgefactored graph can be parallelized across the entire sentence, unlike the transition-based parser which must sequentially predict and perform each transition. By encoding token-level representations with an Iterated Dilated CNN (ID-CNN) (Strubell et al., 2017), we can also remove the sequential dependencies of the RNN layers. Unlike Strubell et al. (2017) who use 1-dimensional convolutions over the sentence to produce token representations, our network employs 2-dimensional convolutions over the adjacency matrix of the sentence's parse tree, modeling attention from the bottom up. By training with an objective that encourages our model to predict trees using only simple matrix operations, we additionally remove the additional computational cost of dynamic programming inference. Combining all of these ideas, we present Dilated Iterated Graph CNNs (DIG-CNNs): a combined convolutional neural network architecture and training objective for efficient, end-to-end GPU graph-based dependency parsing.
We demonstrate the efficacy of these models in experiments on English Penn TreeBank, in which our models perform similarly to the state-of-theart.

Dilated Convolutions
Though common in other areas such as computer vision, 2-dimensional convolutions are rarely used in NLP since it is usually unclear how to process text as a 2-dimensional grid. However, 2dimensional convolutional layers are a natural model for embedding the adjacency matrix of a sentence's parse.
A 2-dimensional convolutional neural network layer transforms each input element, in our case an edge in the dependency graph, as a linear function of the width r w and height r h window of surrounding input elements (other possible edges in the dependency graph). In this work we assume square convolutional windows: r h = r w .
Dilated convolutions perform the same operation, except rather than transforming directly adjacent inputs, the convolution is defined over a wider input window by skipping over δ inputs at a time, where δ is the dilation width. A dilated convolution of width 1 is equivalent to a simple convolution. Using the same number of parameters as a simple convolution with the same radius, the δ > 1 dilated convolution incorporates broader context into the representation of a token than a simple convolution.

Iterated Dilated CNNs
Stacking many dilated CNN layers can easily incorporate information from a whole sentence. For example, with a radius of 1 and 4 layers of dilated convolutions, the effective input window size for each token is width 31, which exceeds the average sentence length (23) in the Penn TreeBank corpus. However, simply increasing the depth of the CNN can cause considerable over-fitting when data is sparse relative to the growth in model parameters.
To address this, we employ Iterated Dilated CNNs (ID-CNNs) (Strubell et al., 2017), which instead apply the same small stack of dilated convolutions repeatedly, each time taking the result of the last stack as input to the current iteration. Applying the parameters recurrently in this way increases the size of the window of context incorporated into each token representation while allowing the model to generalize well. Their training objective additionally computes a loss for the output of each application, encouraging parameters that allow subsequent stacks to resolve dependency violations from their predecessors.

Dilated Iterated Graph CNNs
We describe how to extend ID-CNNs (Strubell et al., 2017) to 2-dimensional convolutions over the adjacency matrix of a sentence's parse tree, allowing us to model the parse tree through the whole network, incorporating evidence about nearby head-dependent relationships in every layer of the network, rather than modeling at the token level followed by a single layer of attention to produce head-dependent compatibilities between tokens. ID-CNNs allow us to efficiently incorporate evidence from the entire tree without sacrificing generalizability.

Model architecture
. , x T ] be our input text 1 Let y = [y 1 , . . . , y T ] be labels with domain size D for the edge between each token x i and its head x j . We predict the most likely y, given a conditional model P (y|x) where the tags are conditionally independent given some features for x: The local conditional distributions of Eqn. (1) come from a straightforward extension of ID-CNNs (Strubell et al., 2017) to 2-dimensional convolutions. This network takes as input a sequence of T vectors x t , and outputs a T × T matrix of per-class scores h ij for each pair of tokens in the sentence.
We denote the kth dilated convolutional layer of dilation width δ as D (k) δ . The first layer in the network transforms the input to a graph by concatenating all pairs of vectors in the sequence x i , x j and applying a 2-dimensional dilation-1 convolution D We denote vector concatenation with [·; ·]. Next, L c layers of dilated convolutions of exponentially increasing dilation width are applied to c ij (0) , folding in increasingly broader context into the embedded representation of e ij at each layer. Let r() denote the ReLU activation function (Glorot et al., 2011). Beginning with c t (0) = i t we define the stack of layers with the following recurrence: and add a final dilation-1 layer to the stack: We refer to this stack of dilated convolutions as a block B(·), which has output resolution equal to its input resolution. To incorporate even broader context without over-fitting, we avoid making B deeper, and instead iteratively apply B L b times, introducing no extra parameters. Starting with b t (1) = B (i t ), we define the output of block m: We apply a simple affine transformation W o to this final representation to obtain label scores for each edge e ij : We can obtain the most likely head (and its label) for each dependent by computing the argmax over all labels for all heads for each dependent:

Training
Our main focus is to apply the DIG-CNN as feature extraction for the conditional model described in Sec. 3.1, where tags are conditionally independent given deep features, since this will enable prediction that is parallelizable across all possible edges. Here, maximum likelihood training is straightforward because the likelihood decouples into the sum of the likelihoods of independent logistic regression problems for every edge, with natural parameters given by Eqn. (6): We could also use the DIG-CNN as input features for an MST parser, where the partition function and its gradient are computed using Kirchhoffs Matrix-Tree Theorem (Tutte, 1984), but our aim is to approximate inference in a treestructured graphical model using greedy inference and expressive features over the input in order to perform inference as efficiently as possible on a GPU.
To help bridge the gap between these two techniques, we use the training technique described in (Strubell et al., 2017). The tree-structured graphical model has preferable sample complexity and accuracy since prediction directly reasons in the space of structured outputs. Instead, we compile some of this reasoning in output space into DIG-CNN feature extraction. Instead of explicit reasoning over output labels during inference, we train the network such that each block is predictive of output labels. Subsequent blocks learn to correct dependency violations of their predecessors, refining the final sequence prediction.
To do so, we first define predictions of the model after each of the L b applications of the block. Let h t (m) be the result of applying the matrix W o from (6) to b t (m) , the output of block m. We minimize the average of the losses for each application of the block: By rewarding accurate predictions after each application of the block, we learn a model where later blocks are used to refine initial predictions. The loss also helps reduce the vanishing gradient problem (Hochreiter, 1998) for deep architectures.
We apply dropout (Srivastava et al., 2014) to the raw inputs x ij and to each block's output b t (m) to help prevent overfitting.

Related work
Currently, the most accurate parser in terms of labeled and unlabeled attachment scores is the neural network graph-based dependency parser of Dozat and Manning (2017). Their parser builds token representations with a bidirectional LSTM over word embeddings, followed by head and dependent MLPs. Compatibility between heads and dependents is then scored using a biaffine model, and the highest scoring head for each dependent is selected.
Previously, (Chen and Manning, 2014) pioneered neural network paring with a transitionbased dependency parser which used features from a fast feed-forward neural network over word, token and label embeddings. Many improved upon this work by increasing the size of the network and using a structured training objective (Weiss et al., 2015;Andor et al., 2016). (Kiperwasser and Goldberg, 2016) were the first to present a graph-based neural network parser, employing an MLP with bidirectional LSTM inputs to score arcs and labels. (Cheng et al., 2016) propose a similar network, except with additional forward and backward encoders to allow for conditioning on previous predictions. (Kuncoro et al., 2016) take a different approach, distilling a consensus of many LSTM-based transition-based parsers into one graph-based parser. (Ma and Hovy, 2017) employ a similar model, but add a CNN over characters as an additional word representation and perform structured training using the Matrix-Tree Theorem. Hashimoto et al. (2017) train a large network which performs many NLP tasks including part-of-speech tagging, chunking, graph-based parsing, and entailment, observing benefits from multitasking with these tasks.
Despite their success in the area of computer vision, in NLP convolutional neural networks have mainly been relegated to tasks such as sentence classification, where each input sequence is mapped to a single label (rather than a label for each token) Kim (2014); Kalchbrenner et al. (2014); Zhang et al. (2015); Toutanova et al. (2015). As described above, CNNs have also been used to encode token representations from embeddings of their characters, which similarly perform a pooling operation over characters. Lei et al. (2015) present a CNN variant where convolutions adaptively skip neighboring words. While the flexibility of this model is powerful, its adaptive behavior is not well-suited to GPU acceleration.
More recently, inspired by the success of deep dilated CNNs for image segmentation in computer vision (Yu and Koltun, 2016;, convolutional neural networks have been employed as fast models for tagging, speech generation and machine translation. (van den Oord et al., 2016) use dilated CNNs to efficiently generate speech, and Kalchbrenner et al. (2016) describes an encoder-decoder model for machine translation which uses dilated CNNs over bytes in both the encoder and decoder. Strubell et al. (2017) first described the one-dimensional ID-CNN architecture which is the basis for this work, demonstrating its success as a fast and accurate NER tagger. Gehring et al. (2017) report state-ofthe-art results and much faster training from using many CNN layers with gated activations as encoders and decoders for a sequence-to-sequence model. While our architecture is similar to the encoder architecture of these models, ours is differentiated by (1) being tailored to smaller-data regimes such as parsing via our iterated architecture and loss, and (2) employing two-dimensional convolutions to model the adjacency matrix of the parse tree. We are the first to our knowledge to use dilated convolutions for parsing, or to use twodimensional dilated convolutions for NLP. (2016)

English PTB Results
We compare our models labeled and unlabeled attachment scores to the neural network graph-based dependency parsers described in Sec. 4. Without enforcing trees at test time, our model performs just under the LSTM-based parser of Kiperwasser and Goldberg (2016), and a few points lower than the state-of-the-art. When we post-process our model's outputs into trees, like all the other models in our table, our results increase to perform slightly above Kiperwasser and Goldberg (2016). We believe our model's relatively poor performance compared to existing models is due to its limited incorporation of context from the entire sentence. While each bidirectional LSTM token representation observes all tokens in the sentence, our reported model observes a relatively small window, only 9 tokens. We hypothesize that this window is not sufficient for producing accurate parses. Still, we believe this is a promising architecture for graph-based parsing, and with further experimentation could meet or exceed the stateof-the-art while running faster by better leveraging GPU architecture.

Conclusion
We present DIG-CNNs, a fast, end-to-end convolutional architecture for graph-based dependency parsing. Future work will experiment with deeper CNN architectures which incorporate broader sentence context in order to increase accuracy without sacrificing speed.