Long-Distance Dependencies Don’t Have to Be Long: Simplifying through Provably (Approximately) Optimal Permutations

Neural models at the sentence level often operate on the constituent words/tokens in a way that encodes the inductive bias of processing the input in a similar fashion to how humans do. However, there is no guarantee that the standard ordering of words is computationally efficient or optimal. To help mitigate this, we consider a dependency parse as a proxy for the inter-word dependencies in a sentence and simplify the sentence with respect to combinatorial objectives imposed on the sentence-parse pair. The associated optimization results in permuted sentences that are provably (approximately) optimal with respect to minimizing dependency parse lengths and that are demonstrably simpler. We evaluate our general-purpose permutations within a fine-tuning schema for the downstream task of subjectivity analysis. Our fine-tuned baselines reflect a new state of the art for the SUBJ dataset and the permutations we introduce lead to further improvements with a 2.0% increase in classification accuracy (absolute) and a 45% reduction in classification error (relative) over the previous state of the art.


Introduction
Natural language processing systems that operate at the sentence level often need to model the interaction between different words in a sentence. This kind of modelling has been shown to be necessary not only in explicit settings where we consider the relationships between words (GuoDong et al., 2005;Fundel et al., 2006) but also in opinion mining (Joshi and Penstein-Rosé, 2009), question answering (Cui et al., 2005), and semantic role labelling (Hacioglu, 2004). A standard roadblock in this process has been trying to model longdistance dependencies between words. Neural models for sentence-level tasks, for example, have leveraged recurrent neural networks (Sutskever et al., 2014) and attention mechanisms (Bahdanau et al., 2015;Luong et al., 2015) as improvements in addressing this challenge. LSTMs (Hochreiter and Schmidhuber, 1997) in particular have been touted as being well-suited for capturing these kinds of dependencies but recent work suggests that the modelling may be insufficient to various extents (Linzen et al., 2016;Liu et al., 2018;Dangovski et al., 2019). Fundamentally, these neural components do not restructure the challenge of learning long-distance dependencies but instead introduce computational expressiveness as a means to represent and retain inter-word relationships efficiently (Kuncoro et al., 2018).
Models that operate at the sentence level in natural language processing generally process sentences word-by-word in a left-to-right fashion. Some models, especially recurrent models, consider the right-to-left traversal (Sutskever et al., 2014) or a bidirectional traversal that combines both the left-to-right and right-to-left traversals (Schuster and Paliwal, 1997). Other models weaken the requirement of sequential processing by incorporating position embeddings to retain the sequential nature of the data and then use selfattentive mechanisms that don't explicitly model the sequential nature of the input (Vaswani et al., 2017). All such approaches encode the prior that computational processing of sentences should appeal to a cognitively plausible ordering of words.
Nevertheless in machine translation, reorderings of both the input and output sequences have been considered for the purpose of improving alignment between the source and target languages. Specifically, preorders, or permuting the input sequence, and postorders, or permuting the output sequence, have been well-studied in statistical machine translation (Xia and McCord, 2004;Goto et al., 2012) and have been recently integrated towards fully neural machine translation (De Gispert et al., 2015;Kawara et al., 2018). In general, these re-ordering methods assume some degree of supervision (Neubig et al., 2012) and have tried to implicitly maintain the original structure of the considered sequence despite modifying it to improve alignment. Similar approaches have also been considered for cross-lingual transfer in dependency parsing (Wang and Eisner, 2018) based on the same underlying idea of improving alignment.
In this work, we propose a general approach for permuting the words in an input sentence based on the notion of simplification, i.e. reducing the length of inter-word dependencies in the input. In particular, we appeal to graph-based combinatorial optimization as an unsupervised approach for producing permutations that are provably optimal, or provably approximately optimal, in minimizing inter-word dependency parse lengths.
Ultimately, we hypothesize that our simplification-based permutations over input sentences can be incorporated as a lightweight, drop-in preprocessing step for neural models to improve performance for a number of standard sentence-level NLP problems. As an initial case study, we consider the task of sentence-level subjectivity classification and using the SUBJ dataset (Pang and Lee, 2004), we first introduce baselines that achieve a state of the art 95.8% accuracy and further improve on these baselines with our permutations to a new state of the art of 97.5% accuracy.

Limitations
This work considers simplifying inter-word dependencies for neural models. However, we measure inter-word dependencies using dependency parses and therefore rely on an incomplete description of inter-word dependencies in general. Further, we assume the existence of a strong dependency parser, which is reasonably wellfounded for English which is the language studied in this work. This assumption is required for providing theoretical guarantees regarding the optimality of sentence permutations with respect to the gold-standard dependency parse. 1 In spite of these assumptions, it is still possible for the subsequent neural models to recover from errors in the initial sentence permutations.
Beyond this, we consider dependency parses which graph theoretically are edge-labelled directed trees. However, in constructing optimal sentence permutations, we simplify the graph structure by neglecting edge labels and edge directions. Both of these are crucial aspects of a dependency parse tree and in §6 we discuss possible future directions to help address these challenges.
Most concerningly, this approach alters the order of words in a sentence for the purpose of simplifying one class of dependencies -binary interword dependencies marked by dependency parses. However, in doing so, it is likely that other crucial aspects of the syntax and semantics of a sentence that are a function of word order are obscured. We believe the mechanism proposed in §3.3 helps to alleviate this by making use of powerful initial word representations that are made available through recent advances in pretrained contextual representations and transfer learning (Peters et al., 2018;Devlin et al., 2018;Liu et al., 2019).

Model
Our goal is to take a dependency parse of a sentence and use it is as scaffold for permuting the words in a sentence. We begin by describing two combinatorial measures on graphs that we can use to rank permutations of the words in a sentence by, and therefore optimize with respect to, in order to find the optimal permutation for each measure. Given the permutation, we then train a model end-to-end on a downstream task and exploit pretrained contextual word embeddings to initialize the word representations for our model.

Input Structure
Given a sentence as an input for some downstream task, we begin by computing a dependency parse for the sentence using an off-theshelf dependency parser. This endows the sentence with a graph structure corresponding to an edge-labelled directed tree G * = (V * , E * ) where the vertices correspond to tokens in the sentence (V * = {w 1 , w 2 , . . . , w n }) and the edges correspond to dependency arcs. We then consider the undirected tree G = (V, E) where V = V * and E = E * without the edge labels and edge directions.

Combinatorial Objectives
We begin by defining a linear layout on G to be a bijective, i.e. one-to-one, ordering on the vertices π : V → {1, 2, . . . , n}. For a graph associated with a sentence, we consider the identity linear layout π I to be given by π I (w i ) = i: the linear layout of vertices is based on the word order in the input sentence. For any given linear layout π we can further associate each edge (u, v) ∈ E with an edge distance d u,v = |π(u) − π(v)|. 2 By considering the modified dependency parse G alongside the sentence, we recognize that a computational model of the sentence may need to model any given dependency arc (w i , w j ) ∈ E. As a result, for a model that processes sentences word-by-word, information regarding this arc must be stored for a number of time-steps given by This implies that a model may need to store a dependency for a large number of time-steps (a long-distance dependency) and we instead consider finding an optimal linear layout π * (that is likely not to be the identity) to minimize these edge distances with respect to two well-studied objectives on linear layouts.

Bandwidth Problem
The bandwidth problem on graphs corresponds to finding an optimal linear layout π * under the objective: argmin The bandwidth problem is a well known problem dealing with linear layouts with applications in sparse matrix computation (Gibbs et al., 1976) and information retrieval (Botafogo, 1993) and has been posed in equivalent ways for graphs and matrices (Chinn et al., 1982). For dependency parses, it corresponds to finding a linear layout that minimizes the length of the longest dependency. Papadimitriou (1976) proved the problem was NPhard and the problem was further shown to remain NP-hard for trees and even restricted classes of trees (Unger, 1998;Garey et al., 1978). In this work, we use the better linear layout of those produced by the guaranteed O(log n) approximation of Haralambides and Makedon (1997) and the heuristic of Cuthill and McKee (1969) and refer to the resulting linear layout as π * b .
2 Refer to Díaz et al. (2002) for a survey of linear layouts, related problems, and their applications.

Minimum Linear Arrangement Problem
Similar to the bandwidth problem, the minimum linear arrangement (minLA) problem considers finding a linear layout given by: While less studied than the bandwidth problem, the minimum linear arrangement problem considers minimizing the sum of the edge lengths of the dependency arcs which may more closely resemble how models need to not only handle the longest dependency well, as in the bandwidth problem, but also need to handle the other dependencies. Although the problem is NP-hard for general graphs (Garey et al., 1974), it admits polynomial time exact solutions for trees (Shiloach, 1979). We use the algorithm of Chung (1984), which runs in O(n 1.585 ), to find the optimal layout π * m .

Downstream Integration
Given a linear layout π, we can define the associated permuted sentence s of the original sentence s = w 1 w 2 . . . w n where the position of w i in s is given by π(w i ). We can then train models end-to-end taking the permuted sentences as direct replacements for the original input sentences. However, this approach suffers from the facts that (a) the resulting sentences may have lost syntactic/semantic qualities of the original sentences due to the permutations and (b) existing pretrained embedding methods would need to be re-trained with these new word orders, which is computationally expensive, and pretraining objectives like language modelling may be less sensible given the problems noted in (a). To reconcile this, we leverage a recent three-step pattern for many NLP tasks 2. Fine-tuned Sentence Representation: Learn a task-specific encoding of the sentence using a task-specific encoder as a fine-tuning step on top of the pretrained word representations.
3. Task Predictions: Generate a prediction for the task using the fine-tuned representation.
As a result, we can introduce the permutation between steps 1 and 2. What this means is the initial pretrained representations model the sentence using the standard ordering of words and therefore have access to the unchanged syntactic/semantic properties. These properties are diffused into the word-level representations and therefore the finetuning encoder may retrieve them even if they are not observable after the permutation. This allows the focus of the task-specific encoder to shift towards modelling useful dependencies specific to the task.

Experiments
Using our approach, we begin by studying how optimizing for these combinatorial objectives affects the complexity of the input sentence as measured using these objective functions. We then evaluate performance on the downstream task of subjectivity analysis and find our baseline model achieves a new state of the art for the dataset which is improved further by the permutations we introduce.
For all experiments, we use the spaCy dependency parser (Honnibal and Montani, 2017) to find the dependency parse. In studying properties of the bandwidth optimal permutation π * b and the minLA optimal permutation π * m , we compare to baselines where the sentence is not permuted/the identity permutation π I as well as where the words in the sentence are ordered using a random permutation π R . A complete description of experimental and implementation details is provided in Appendix A.
Our permutations do not introduce or change the size or runtime of existing models while providing models with dependency parse information implicitly. The entire preprocessing process, including the computation of permutations for both objectives, takes 21 minutes in aggregate for the 10000 examples in the SUBJ dataset. A complete description of changes in model size, runtime, and convergence speed is provided in Appendix B.

Data and Evaluation
To evaluate the direct effects of our permutations on input sentence simplification, we use 100000 sentences from Wikipedia; to evaluate the downstream impacts we consider the SUBJ dataset (Pang and Lee, 2004) for subjectivity analysis. The subjectivity analysis task requires deciding whether a given sentence is subjective or objective and the dataset is  Figure 1: Example of the sentence permutation along with overlayed dependency parses. Blue indicates the standard ordering, green indicates the bandwidth optimal ordering, and red indicates the minLA optimal ordering. Black indicates the longest dependency arc in the original ordering.
balanced with 5000 subjective and 5000 objective examples. We consider this task as a starting point as it is well-studied and dependency features have been shown to be useful for similar opinion mining problems (Wu et al., 2009).
Examples In Figure 1, we present an example sentence and its permutations under π I , π * b and π * m . Under the standard ordering, the sentence has bandwidth 8 and minLA score 22 and this is reduced by both the bandwidth optimal permutation to 3 and 17 respectively and similarly the minLA permutation also improves on both objectives with scores of 6 and 16 respectively. A model processing the sequence word-by-word may have struggled to retain the long dependency arc linking 'reject' and 'won' and therefore incorrectly deemed that 'actor' was the subject of the verb 'won' as it is the only other viable candidate and is closer to the verb. If this had occured, it would lead an incorrect interpretation (here the opposite meaning). While both of the introduced permutations still have 'actor' closer to the verb, the distance between 'reject' and 'won' shrinks (denoted by the black arcs) and similarly the distance between 'unlike' and 'actor' shrinks. These combined effects map help to mitigate this issue and allow for improved modelling. Across the Wikipedia data, we see a similar pattern for the minLA optimal permutations in that they yield improvements on both objectives but we find the bandwidth optimal permutations on average increase the minLA score as is shown in Table 1. We believe this is natural given the relationship of the objectives; the longest arc is accounted for in the minLA objective whereas the other arcs don't contribute to the  Table 2: Accuracy on the SUBJ dataset using the specified ordering of pretrained representations for the finetuning LSTM. † indicates prior models that were evaluated using 10-fold cross validation instead of a held-out test set.
bandwidth cost. We also find the comparison of the standard and random orderings to be evidence that human orderings of words to form sentences (at least in English) are correlated with these objectives, as they are significantly better with respect to these objectives as compared to random orderings. Refer to Figure 3 for a larger example.

Downstream Performance
In Table 2, we present the results on the downstream task. Despite the fact that the random permutation LSTM encoder cannot learn from the word order and implicitly is restrained to permutation-invariant features, the associated model performs comparably with previous state of the art systems, indicating the potency of current pretrained embeddings and specifically ELMo. When there is a deterministic ordering, we find that the standard ordering is the least helpful of the three orderings considered. We see a particularly significant spike in performance when using permutations that are minLA optimal and we conjecture that this may be because minLA permutations improve on both objectives on average and empirically we observe they better maintain the order of the original sentence (as can be seen in Figure 1).

Related Work
This work draws upon inspiration from the literature on psycholinguistics and cognitive science. Specifically, dependency lengths and the existing minimization in natural language has been studied under the dependency length minimization (DLM) hypothesis (Liu, 2008) which posits a bias in human languages towards constructions with shorter dependency lengths. 3 In particular, the relationship described between random and natural language orderings of words to form sentences as in Table 1 has been studied more broadly across 37 natural languages in Futrell et al. (2015). This work, alongside Gildea and Temperley (2010) and NLP that has tried to probe for dependencyoriented understanding in neural networks (primarily RNNs) does indicate relationships with specific dependency-types and RNN understanding. This includes research considering specific dependency types (Wilcox et al., 2018(Wilcox et al., , 2019a, word-order effects (Futrell and Levy, 2019), and structural supervision (Wilcox et al., 2019b).
Prompted by this, the permutations considered in this work can alternatively be seen as linearizations (Langkilde and Knight, 1998;Filippova and Strube, 2009;Futrell and Gibson, 2015; Puzikov and Gurevych, 2018) of a dependency parse in a minimal fashion which is closely related to Gildea and Temperley (2007); Temperley and Gildea (2018). While such linearizations have not been well-studied for downstream impacts, the usage of dependency lengths as a constraint has been studied for dependency parsing itself. Towards this end, Eisner and Smith (2010) showed that using dependency length can be a powerful heuristic tool in dependency parsing (by either enforcing a strict preference or favoring a soft preference for shorter dependencies).

Future Directions
Graph Structure Motivated by recent work on graph convolutional networks that began with undirected unlabelled graphs (Kipf and Welling, 2016; Zhang et al., 2018) that was extended to include edge direction and edge labels (Marcheggiani and Titov, 2017), we consider whether these features of a dependency parse can also leveraged in computing an optimal permutation. We argue that bidirectionally traversing the permuted sequence may be sufficient to address edge direction. A natural approach to encode edge labels would be to define a mapping (either learned on an auxiliary objective or tuned as a hyperparameter) from categorical edge labels to numericals edge weights and then consider the weighted analogues of the objectives in Equation 1 and Equation 2.

Improved Objective The objectives introduced in Equation 1 and Equation 2 can be unified by considering the family of cost functions:
Here, minLA correspond to p = 1 and the bandwidth problem corresponds to p = ∞. We can then propose a generalized objective that is the convex combination of the individual objectives, i.e. finding a permutation that minimizes: f α 1,∞ (π) = αf 1 (π) + (1 − α)f ∞ (π) (4) Setting α to 0 or 1 reduces to the original objectives. This form of the new objective is reminiscent of Elastic Net regularization in statistical optimization (Zou and Hastie, 2005). Inspired by this parallel, a Lagrangian relaxation of one of the objectives as a constraint may be an approach towards (approximate) optimization.
Task-specific Permutations The permutations produced by these models are invariant with respect to the downstream task. However, different tasks may benefit from different sentence orders that go beyond task-agnostic simplification. A natural way to model this in neural models is to learn the permutation in a differentiable fashion and train the permutation model end-to-end within the overarching model for the task. Refer to Appendix C for further discussion.

Conclusion
In this work, we propose approaches that permute the words in a sentence to provably minimize com-binatorial objectives related to the length of dependency arcs. We find that this is a lightweight procedure that helps to simplify input sentences for downstream models and that it leads to improved performance and state of the art results (97.5% classification accuracy) for subjectivity analysis using the SUBJ dataset.
Acknowledgements I thank Bobby Kleinberg for his tremendous insight into the present and future algorithmic challenges of layout-based optimization problems, Tianze Shi for his pointers towards the dependency length minimization literature in psycholinguistics and cognitive science, and Arzoo Katiyar for her advice on comparing against random sequence orderings and considering future work towards soft/differentiable analogues. I also thank the anonymous reviewers for their perceptive and constructive feedback. Finally, I thank Claire Cardie for her articulate concerns and actionable suggestions with regards to this work and, more broadly, for her overarching guidance and warm mentoring as an adviser.  16 with the test set results reported being from the model checkpoint after epoch 13. We also experimented with changing the LSTM task-specific encoder to be unidirectional but found the results were strictly worse.

B Efficiency Analysis
Model Size The changes we introduce only impact the initial preprocessing and ordering of the pretrained representations for the model. As a result, we make no changes to the number of model parameters and the only contribution to the model footprint is we need to store the permutation on a per example basis. This can actually be avoided in the case where we have frozen pretrained embeddings as the permutation can be computed in advance. Therefore, for the results in this paper, the model size is entirely unchanged.
Runtime The wall-clock training time, i.e. the wall-clock time for a fixed number of epochs, and inference time are unchanged as we do not change the underlying model in any way and the permutations can be precomputed. As noted in the paper, on a single CPU it takes 21 minutes to complete the entire preprocessing process and 25% of this time is a result of computing bandwidth optimal permutations and 70% of this time is a result of computing minLA optimal permutations. The preprocessing time scales linearly in the number of examples and we verify this as it takes 10 minutes to process only the subjective examples (and the dataset is balanced). Figure 2 shows the development set performance for each of the permutation types over the course of the fine-tuning process.

C End-to-End Permutations
In order to approach differentiable optimization for permutations, we must specify a representation. A standard choice that is well-suited for linear algebraic manipulation is a permutation matrix, i.e P π ∈ R n×n , where P π [i, j] = 1 if π(i) = j and 0 otherwise. As a result, permutation matrices are discrete, and therefore sparse, in the space of real matrices. As such they are poorly suited for the gradient-based optimization that supports most neural models. A recent approach from vision has considered a generalization of permutation matrices to the associated class of doubly stochastic matrices and then considered optimization with respect to the manifold they define (the Sinkhorn Manifold) to find a discrete permutation (Santa Cruz et al., 2017). This approach cannot be immediately applied for neural models for sentences since the algorithms exploits that images, and therefore permutations of the pixels in an image, are of fixed size between examples. That being said we ultimately see this as being an important direction of study given the shift from discrete optimization to soft/differentiable alternatives for similar problems in areas such as structured prediction.
She , among others excentricities 5 , talks to a small rock , Gertrude , like if she was alive BW:  Figure 3: Addition example sentence with sentence permutations and overlayed dependency parses. Blue indicates the standard ordering, green indicates the bandwidth optimal ordering, and red indicates the minLA optimal ordering. Black indicates the longest dependency arc in the original ordering.