Transition-based Parsing with Lighter Feed-Forward Networks

We explore whether it is possible to build lighter parsers, that are statistically equivalent to their corresponding standard version, for a wide set of languages showing different structures and morphologies. As testbed, we use the Universal Dependencies and transition-based dependency parsers trained on feed-forward networks. For these, most existing research assumes de facto standard embedded features and relies on pre-computation tricks to obtain speed-ups. We explore how these features and their size can be reduced and whether this translates into speed-ups with a negligible impact on accuracy. The experiments show that grand-daughter features can be removed for the majority of treebanks without a significant (negative or positive) LAS difference. They also show how the size of the embeddings can be notably reduced.


Introduction
Transition-based models have achieved significant improvements in the last decade (Nivre et al., 2007;Chen and Manning, 2014;Rasooli and Tetreault, 2015;Shi et al., 2017). Some of them already achieve a level of agreement similar to that of experts on English newswire texts (Berzak et al., 2016), although this does not generalize to other configurations (e.g. lower-resource languages). These higher levels of accuracy often come at higher computational costs (Andor et al., 2016) and lower bandwidths, which can be a disadvantage for scenarios where speed is more relevant than accuracy . Furthermore, running neural models on small devices for tasks such as part-of-speech tagging or word segmentation has become a matter of study (Botha et al., 2017), showing that small feed-forward networks are suitable for these challenges. However, for parsers that are trained using neural networks, little exploration has been done beyond the application of pre-computation tricks, initially intended for fast neural machine translation (Devlin et al., 2014), at a cost of affordable but larger memory.
Contribution We explore efficient and light dependency parsers for languages with a variety of structures and morphologies. We rely on neural feed-forward dependency parsers, since their architecture offers a competitive accuracy vs bandwidth ratio and they are also the inspiration for more complex parsers, which also rely on embedded features but previously processed by bidirectional LSTMs (Kiperwasser and Goldberg, 2016). In particular, we study if the de facto standard embedded features and their sizes can be reduced without having a significant impact on their accuracy. Building these models is of help in downstream applications of natural language processing, such as those running on small devices and also of interest for syntactic parsing itself, as it makes it possible to explore how the same configuration affects different languages. This study is made on the Universal Dependencies v2.1, a testbed that allows us to compare a variety of languages annotated following common guidelines. This also makes it possible to extract a robust and fair comparative analysis.

Computational efficiency
The usefulness of dependency parsing is partially thanks to the efficiency of existing transitionbased algorithms, although to the date it is an open question which algorithms suit certain languages better. To predict projective structures, a number of algorithms that run in O(n) with respect to the length of the input string are available. Broadly speaking, these parsers usually keep two structures: a stack (containing the words that are waiting for some arcs to be created) and a buffer (containing words awaiting to be processed). The ARC-STANDARD parser (Nivre, 2004) follows a strictly bottom-up strategy, where a word can only be assigned a head (and removed from the stack) once every daughter node has already been processed. The ARC-EAGER parser avoids this restriction by including a specific transition for the reduce action. The ARC-HYBRID algorithm (Kuhlmann et al., 2011) mixes characteristics of both algorithms. More recent algorithms, such as ARC-SWIFT, have focused on the ability to manage non-local transitions (Qi and Manning, 2017) to reduce the limitations of transitionbased parsers with respect to graph-based ones (McDonald et al., 2005;Dozat and Manning, 2017), that consider a more global context. To manage non-projective structures, there are also different options available. The Covington (2001) algorithm runs in O(n 2 ) in the worst case, by comparing the word in the top of the buffer with a subset of the words that have been already processed, deciding whether or not to create a link with each of them. More efficient algorithms such as SWAP (Nivre, 2009) manage non-projectivity by learning when to swap pairs of words that are involved in a crossing arc, transforming it into a projective problem, with expected execution in linear time. The 2-PLANAR algorithm (Gómez-Rodríguez and Nivre, 2010) decomposes trees into at most two planar graphs, which can be used to implement a parser that runs in linear time. The NON-LOCAL COVINGTON algorithm (Fernández-González and Gómez-Rodríguez, 2018) combines the advantages of the wide coverage of the Covington (2001) algorithm with the non-local capabilities of the Qi and Manning (2017) transition system, running in quadratic time in the worst case.

Fast dependency parsing strategies
Despite the advances in transition-based algorithms, dependency parsing still is the bottleneck for many applications. This is due to collateral issues such as the time it takes to extract features and the multiple calls to the classifier that need to be made. In traditional dependency parsing systems, such as MaltParser (Nivre et al., 2007), the oracles are trained relying on machine learning algorithms, such as support vector machines, and hand-crafted (Huang et al., 2009;Zhang and Nivre, 2011) or automatically optimized sets of features (Ballesteros and Nivre, 2012). The goal usually is to maximize accuracy, which often comes at a cost of bandwidth. In this sense, efforts were made in order to obtain speedups. Using linear classifiers might lead to faster parsers, at a cost of accuracy and larger memory usage (Nivre and Hall, 2010). Bohnet (2010) illustrates that mapping the features into weights for a support vector machine is the major issue for the execution time and introduces a hash kernel approach to mitigate it. Volokh (2013) made efforts on optimizing the feature extraction time for the Covington (2001) algorithm, defining the concept of static features, which can be reused through different configuration steps. The concept itself does not imply a reduction in terms of efficiency, but it is often employed in conjunction with the reduction of non-static features, which causes a drop in accuracy.
In more modern parsers, the oracles are trained using feed-forward networks (Titov and Henderson, 2007;Chen and Manning, 2014;Straka et al., 2015) and sequential models (Kiperwasser and Goldberg, 2016). In this sense, to obtain significant speed improvements it is common to use the pre-computation trick from Devlin et al. (2014), initially intended for machine translation. Broadly speaking, they precompute the output of the hidden layer for each individual feature and each position in the input vector where they might occur, saving computation time during the test phase, with an affordable memory cost. Vacariu (2017) proposes an optimized parser and also includes a brief evaluation about reducing features that have a high cost of extraction, but the analysis is limited to English and three treebanks. However, little analysis has been made on determining if these features are relevant across a wide variety of languages that show different particularities. Our work is also in line with this line of research. In particular, we focus on feed-forward transition-based parsers, which already offer a very competitive accuracy vs bandwidth ratio. The models used in this work do not use any pre-computation trick, but it is worth pointing out that the insights of this paper could be used in conjunction with it, to obtain further bandwidth improvements.
Transition-based dependency parsers whose oracles are trained using feed-forward neural networks have adopted as the de facto standard set of features the one proposed by Chen and Manning (2014) to parse the English and Chinese Penn Treebanks (Marcus et al., 1993;Xue et al., 2005).
We hypothesize this de facto standard set of features and the size of the embeddings used to represent them can be reduced for a wide variety of languages, obtaining significant speed-ups at a cost of a marginal impact on their performance. To test this hypothesis, we are performing an evaluation over the Universal Dependencies v2.1  a wide multilingual testbed to approximate relevant features over a wide variety of languages from different families.

Methods and Materials
This section describes the parsing algorithms ( §4.1), the architecture of the feed-forward network ( §4.2) and the treebanks ( §4.3).

Transition-based algorithms
Let w = [w 1 , w 2 , ..., w |w| ] be an input sentence, a dependency tree for w is an edge-labeled directed tree T = (V, A) where V = {0, 1, 2, . . . , |w|} is the set of nodes and A = V × D × V is the set of labeled arcs. Each arc a ∈ A, of the form (i, d, j), corresponds to a syntactic dependency between the words w i and w j ; where i is the index of the head word, j is the index of the child word and d is the dependency type representing the kind of syntactic relation between them. Each transition configuration is represented as a 3-tuple c = (σ, β, A) where: • σ is a stack that contains the words that are awaiting for remaining arcs to be created. In σ|i, i represents the first word of the stack.
• β is a buffer structure containing the words that still have not been processed (awaiting to be moved to σ. In i|β, i denotes the first word of the buffer.
• A is the set of arcs that have been created.
We rely on two transition-based algorithms: the stack-based ARC-STANDARD (Nivre, 2008) algorithm for projective parsing and its corresponding version with the SWAP operation (Nivre, 2009) to manage non-projective structures. The election of the algorithms is based on their computational complexity as both run in O(n) empirically. The set of transitions is shown in Table 1. Let c i = ([0], β, {}) be an initial configuration, the parser will apply transitions until a final con- Step

Feed-forward neural network
We reproduce the Chen and Manning (2014) architecture and more in particular the Straka et al. (2015) version. These two parsers report the fastest architectures for transition-based dependency parsing (using the pre-computation trick from Devlin et al. (2014)), and obtain results close to the state of the art. Let MLP θ (v) be an abstraction of our multilayered perceptron parametrized by θ, the output for an input v (in this paper, a concatenation of embeddings, as described in §5) is computed as: (1) where W i and b i are the weights and bias tensors to be learned at the ith layer and softmax and relu correspond to the activation functions in their standard form.

Universal Dependencies v2.1
Universal dependencies (UD) v2.1 ) is a set of 101 dependency treebanks for up to 60 different languages. They are labeled in the CoNLLU format, heavily inspired in the CoNLL format (Buchholz and Marsi, 2006). For each word in a sentence there is available the following information: ID, WORD, LEMMA, UPOSTAG (universal postag, available for all languages), XPOSTAG (language-specific postag, available for some languages), FEATS (additional morphosyntactic information, available for some languages), HEAD, DEPREL and other optional columns with additional information.
In this paper, we are only considering experiments on the unsuffixed treebanks (where UD English is an unsuffixed treebank and UD English-PUD is a suffixed treebank). The motivation owes to practical issues and legibility of tables and discussions.

Experiments
We followed the training configuration proposed by Straka et al. (2015). All models where trained using mini-batches (size=10) and stochastic gradient descent (SGD) with exponential decay (lr = 0.02, decay computed as lr × e −0.2×epoch ). Dropout was set to 50%. With our implementation dropout was observed to work better than regularization with less effort in terms of tuning. We used internal embeddings, initialized according to a Glorot uniform (Glorot and Bengio, 2010), which are learned together with the oracle during the training phase. In the experiments we use no external embeddings, following the same criteria as Straka et al. (2015). The aim was to evaluate all parsers under a homogeneous configuration, and high-quality external embeddings may be difficult to obtain for some languages.
The experiments explore two paths: (1) is it possible to reduce the number of features without a significant loss in terms of accuracy? and (2) is it possible to reduce the size of the embeddings representing those features, also without causing significant loss in terms of accuracy? To evaluate this, we used as baseline the following configuration.

Baseline configuration
This configuration reproduces that of Straka et al. (2015) which is basically a version of the Chen and Manning (2014) parser whose features were specifically adapted to the UD treebanks: De facto standard features The initial set of features, which we call the de facto standard features, is composed of: FORM, UPOSTAG and FEATS for the first 3 words in β and the first 3 words of σ. The FORM, UPOSTAG, FEATS and DE-PREL 1 for the 2 leftmost and rightmost children of the first 2 words in σ. And the FORM, UPOSTAG, FEATS and DEPREL of the leftmost of the leftmost and rightmost of the rightmost children of the first 2 words in σ. This makes a total of 18 elements 1 Once it has been assigned and 66 different features. In the case of UD treebanks, it is worth noting that for some languages the FEATS features are not available. We thought of two strategies in this situation: (1) not to consider any FEATS vector as input or (2) assume that a dummy input vector is given to represent the FEATS of an element of the tree. The former would be more realistic in a real environment, but we believe the latter offers a fairer comparison of speeds and memory costs, as the input vector is homogeneous across all languages. Thus, this is the option we have implemented. The dummy vector is expected to be given no relevance by the neural network during the training phase. We also rely on gold UPOSTAGs and FEATS to measure the impact of the reduced features and their reduced size in an isolated environment. 2 Size of the embeddings The embedding size for the FORM features is set to 50 and for the UP-OSTAG, FEATS and DEPREL features it is set to 20. Given an input configuration, the final dimension of the input vector is 1860: 540 dimensions from directly accessible nodes in σ and β, 880 dimensions corresponding to daughter nodes and 440 dimensions corresponding to grand-daughter nodes.
Metrics We use LAS (Labeled Attachment Score) to measure the performance. To determine whether the gain or loss with respect to the de facto standard features is significant or not, we used Bikel's randomized parsing evaluation comparator (p < 0.05), a stratified shuffling significance test. The null hypothesis is that the two outputs are produced by equivalent models and so the scores are equally likely. To refute it, it first measures the difference obtained for a metric by the two models. Then, it shuffles scores of individual sentences between the two models and recomputes the evaluation metrics, measuring if the new difference is smaller than the original one, which is an indicator that the outputs are significantly different. Thousands of tokens parsed per second is the metric used to compare the speed between different feature sets. To diminish the impact of running time outliers, this is averaged across five runs.
Hardware All models were run on the test set on a single thread on a Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz.
No precomputation trick All the parsers proposed in this work do not use the precomputation trick from Devlin et al. (2014). There is no major reason for this, beyond measuring the impact of the strategies in a simple scenario. We would like to remark that the speed-ups obtained here by reducing the number of features could also be applied to the parsers implementing this precomputation trick, in the sense that the feature extraction time is lower. No time will be further gained in terms of computation of the hidden activation values. However, in this context, at least in the case of the Chen and Manning (2014) parser, the pre-computation trick is only applied to the 10 000 most common words. The experiments here proposed are also useful to save memory resources, even if the trick is used. Table 2 shows the impact of ignoring features that have a larger cost of extraction, i.e., daughter and grand-daughter nodes, for both the ARC-STANDARD and SWAP algorithms. It compares three sets of features in terms of performance and speed: (1) de facto standard features, (2) no grand-daughter (NO-GD) features (excluding every leftmost of leftmost and rightmost of rightmost feature) and (3) no daughter (NO-GD/D) features (excluding every daughter and grand-daughter feature from nodes of σ).

Reducing the number of features
Impact of using the NO-GD feature set The results show that these features can be removed without causing a significant difference in most of the cases. In the case of the ARC-STANDARD algorithm, for 48 out of 52 treebanks there is no significant accuracy loss with respect to the de facto standard features. In fact, for 22 treebanks there was a gain with respect to the original set of features, from which 5 of them were statistically significant. With respect to SWAP, we observe similar tendencies. For 39 out of 52 treebanks there is no loss (or the loss is again not statistically significant). There is however a larger number of differences that are statistically significant, both gains (11) and losses (13). On average, the ARC-STANDARD models trained with these features lost 0.1 LAS points with respect to the original models, while the average speed-up was ∼23%. The models trained with SWAP gained instead 0.15 points and the bandwidth increased by ∼28%.
Impact of the NO-GD/D features As expected, the results show that removing daughter features in conjunction with grand-daughter causes a big drop in performance for the vast majority of cases (most of them statistically significant). Due to this issue and despite the (also expected) larger speedups, we are not considering this set of features for the next section.

Reducing the embedding size of the selected features
We now explore whether by reducing the size of the embeddings for the FORM, POSTAG, FEATS and DEPREL features the models can produce better bandwidths without suffering a lack of accuracy. We run separate experiments for the ARC-STANDARD and SWAP algorithms, using as the starting point the NO-GD feature set, which had a negligible impact on accuracy, as tested in Table 2. Table 3 summarizes the experiments when reducing the size of each embedding from 10% to 50%, at a step size of 10 percentage points, for the ARC-STANDARD. The results include information indicating whether the difference in performance is statistically significant from that obtained by the de facto standard set. In general terms, reducing the size of the embeddings causes a small but constant drop in the performance. However, for the vast majority of languages this drop is not statistically significant. Reducing the size of the embeddings by a factor of 0.2 was the configuration with the minimum number of significant losses (6), and reducing them by a factor of 0.5 the one with the largest (14). On average, the lightest models lost 0.45 LAS points to obtain an speed-up of ∼40%. Similar tendencies were observed in the case of the non-projective algorithm, whose results reducing the size of the embeddings by a factor of 0.1 and 0.5 can be found in Table 4.

Discussion
Different deep learning frameworks to build neural networks might present differences and implementation details that might cause the speed obtained empirically to differ from theoretical expectations. From a theoretical point of view, both tested approaches ( §5.2, 5.3) should have a similar impact, as their use directly affects the size of the input to the neural network. The smaller the input size, the lighter and faster parsers are obtained. As a side note, with respect to the case of reducing the   Table 2: Performance for the (1) de facto standard, (2) NO-GD/D and (3) NO-GD set of features, when used to train oracles with the ARC-STANDARD and SWAP algorithms. Red cells indicate a significant loss (--) with respect to the baseline, the yellow ones a non-significant gain(+)/loss (-) and the green ones a significant gain (++).  Table 3: ARC-STANDARD baseline configuration versus different runs with the NO-GD feature set and embedding size reduction from 10% to 50%. See Table 2 for color scheme definition.  number of features ( §5.2), an additional speed improvement is expected, as less features need to be collected. But broadly speaking, the speed obtained by skipping half of the features should be in line with that obtained by reducing the size of the embeddings of the original features by a factor of 0.5.
For a practical point of view, in this work we relied on keras (Chollet et al., 2015). With respect to the part reported in §5.2, the experiments went as expected. Taking as examples the results for the ARC-STANDARD algorithm, using no grand-daughter features implies to diminish the dimension of the input vector from 1860 dimensions to 1420, a reduction of ∼23%. The average thousands of tokens parsed per second of the de facto standard features was 3.0 and the average obtained without grand-daughter features was 3.7, a gain of ∼20%. If we also skip daughter features and reduce the size of the input vector by ∼71%, the speed increased by a factor of 2.5. Similar tendencies were observed with respect to the SWAP algorithm. When reducing the size of the embeddings ( §5.3), the obtained speed-ups were however lower than those expected in theory. In this sense, an alternative implementation or a use of a differ-ent framework could lead to reduce these times to values closer to the theoretical expectation.
Trying other neural architectures is also of high interest, but this is left as an open question for future research. In particular, in the popular BIST-based parsers (Kiperwasser and Goldberg, 2016;de Lhoneux et al., 2017;, the input is first processed by a bidirectional LSTM (Hochreiter and Schmidhuber, 1997) that computes an embedding for each token, taking into account its left and right context. These embeddings are then used to extract the features for transition-based algorithms, including the head of different elements and their leftmost/rightmost children. Those features are then fed to a feedforward network, similar to the one evaluated in this work. Thus, the results of this work might be of future interest for this type of parsers too, as the output of the LSTM can be seen as improved and better contextualized word embeddings.

Conclusion
We explored whether it is possible to reduce the number and size of embedded features assumed as de facto standard by feed-forward network transition-based dependency parsers. The aim was to train efficient and light parsers for a vast amount of languages showing a rich variety of structures and morphologies.
To test the hypothesis we used a multilingual testbed: the Universal Dependencies v2.1. The study considered two transition-based algorithms to train the oracles: a stack-based ARC-STANDARD and its non-projective version, by adding the SWAP operation. We first evaluated three sets of features, clustered according to their extraction costs: (1) the de facto standard features that usually are fed as input to feedforward parsers and consider daughter and granddaughter features, (2) a no grand-daughter feature set and (3) a no grand-daughter/daughter feature set. For the majority of the treebanks we found that the feature set (2) did not cause a significant loss, both for the stack-based ARC-STANDARD and the SWAP algorithms. We then took that set of features and reduced the size of the embeddings used to represent each feature, up to a factor of 0.5. The experiments also show that for both the ARC-STANDARD and the SWAP algorithms these reductions did not cause, in general terms, a significant loss. As a result, we obtained a set of lighter and faster transition-based parsers that achieve a better accuracy vs bandwidth ratio than the original ones. It was observed that these improvements were not restricted to a particular language family or specific morphology.
As future work, it would be interesting to try alternative experiments to see whether reducing the size of embeddings works the same for words as for other features. Also, the results are compatible with existent optimizations and can be used together to obtain further speed-ups. Related to this, quantized word vectors (Lam, 2018) can save memory and be used to outperform traditional embeddings.