Query-focused Sentence Compression in Linear Time

Search applications often display shortened sentences which must contain certain query terms and must fit within the space constraints of a user interface. This work introduces a new transition-based sentence compression technique developed for such settings. Our query-focused method constructs length and lexically constrained compressions in linear time, by growing a subgraph in the dependency parse of a sentence. This theoretically efficient approach achieves an 11x empirical speedup over baseline ILP methods, while better reconstructing gold constrained shortenings. Such speedups help query-focused applications, because users are measurably hindered by interface lags. Additionally, our technique does not require an ILP solver or a GPU.


Introduction
Traditional study of extractive sentence compression seeks to create short, readable, singlesentence summaries which retain the most "important" information from source sentences.But search user interfaces often require compressions which must include a user's query terms and must not exceed some maximum length, permitted by screen space.Figure 1 shows an example.
This study examines the English-language compression problem with such length and lexical requirements.In our constrained compression setting, a source sentence S is shortened to a compression C which (1) must include all tokens in a set of query terms Q and (2) must be no longer than a maximum budgeted character length, b ∈ Z + .Formally, constrained compression maps (S, Q, b) → C, such that C respects Q and b.We describe this task as query-focused compression because Q places a hard requirement on words from S which must be included in C.
Gazprom the Russian state gas giant announced a 40 percent increase in the price of natural gas sold to Ukraine which is heavily dependent on Russia for its gas supply.Existing techniques are poorly suited to constrained compression.While methods based on integer linear programming (ILP) can trivially accommodate such length and lexical restrictions (Clarke and Lapata, 2008;Filippova and Altun, 2013;Wang et al., 2017), these approaches rely on slow third-party solvers to optimize an NPhard integer linear programming objective, causing user wait time.An alternative LSTM tagging approach (Filippova et al., 2015) does not allow practitioners to specify length or lexical constraints, and requires an expensive graphics processing unit (GPU) to achieve low runtime latency (access to GPUs is a barrier in fields like social science and journalism).These deficits prevent application of existing compression techniques in search user interfaces (Marchionini, 2006;Hearst, 2009), where length, lexical and latency requirements are paramount.We thus present a new stateful method for query-focused compression.

Gazprom
Our approach is theoretically and empirically faster than ILP-based techniques, and more accurately reconstructs gold standard compressions.( Knight and Marcu, 2000;Clarke and Lapata, 2008;Filippova et al., 2015;Wang et al., 2017). 1  To our knowledge, this work is the first to consider extractive compression under hard length and lexical constraints.We compare our VERTEX ADDITION approach to ILP-based compression methods (Clarke and Lapata, 2008;Filippova and Altun, 2013;Wang et al., 2017), which shorten sentences using an integer linear programming objective.ILP methods can easily accommodate lexical and budget restrictions via additional optimization constraints, but require worst-case exponential computation. 2 Finally, compression methods based on LSTM taggers (Filippova et al., 2015) cannot currently enforce lexical or length requirements.Future work might address this limitation by applying and modifying constrained generation techniques (Kikuchi et al., 2016;Post and Vilar, 2018;Gehrmann et al., 2018).

Compression via VERTEX ADDITION
We present a new transition-based method for shortening sentences under lexical and length constraints, inspired by similar approaches in transition-based parsing (Nivre, 2003).We describe our technique as VERTEX ADDITION because it constructs a shortening by growing a (possibly disconnected) subgraph in the dependency parse of a sentence, one vertex at a time.This approach can construct constrained compressions with a linear algorithm, leading to 11x lower latency than ILP techniques ( §4).To our knowledge, our method is also the first to construct 1 Some methods compress via generation instead of deletion (Rush et al., 2015;Mallinson et al., 2018).Our extractive method is intended for practical, interpretable and trustworthy search systems (Chuang et al., 2012).Users might not trust abstractive summaries (Zhang and Cranshaw, 2018), particularly in cases with semantic error.
compressions by adding vertexes rather than pruning subtrees in a parse (Knight and Marcu, 2000;Almeida and Martins, 2013;Filippova and Alfonseca, 2015).We assume a boolean relevance model: S must contain Q.We leave more sophisticated relevance models for future work.

Formal description
VERTEX ADDITION builds a compression by maintaining a state (C i , P i ) where C i ⊆ S is a set of added candidates, P i ⊆ S is a priority queue of vertexes, and i indexes a timestep during compression.Figure 2 shows a step-by-step example.
During initialization, we set C 0 ← Q and P 0 ← S \Q.Then, at each timestep, we pop some candidate v i = h(P i ) from the head of P i and evaluate v i for inclusion in C i .(Neighbors of C i in P i get higher priority than non-neighbors; we break ties in left-to-right order, by sentence position).If we accept v i , then (Acceptance decisions are detailed in §4.3.)We continue adding vertexes to C until either P i is empty or C i is b characters long. 3The appendix includes a formal algorithm.
VERTEX ADDITION is linear in the token length of S because we pop and evaluate some vertex from P i at each timestep, after P 0 ← S \ Q.Additionally, because (1) we never accept v i if the length of C i ∪ v i is more than b, and (2) we set C 0 ← Q, our method respects Q and b.
Figure 2: A dependency parse of a sentence S, shown across five timesteps of VERTEX ADDITION (from left to right).Each node in the parse is a vertex in S.
Our stateful method produces the final compression {A,C,B,E} (rightmost).At each timestep, each candidate v i is boxed; rejected candidates ¬C i are unshaded.

Evaluation
We observe the latency, readability and token-level F1 score of VERTEX ADDITION, using a standard dataset (Filippova and Altun, 2013).We compare our method to an ILP baseline ( §2) because ILP methods are the only known technique for constrained compression.All methods have similar compression ratios (shown in appendix), a wellknown evaluation requirement (Napoles et al., 2011).We evaluate the significance of differences between VERTEX ADDITION LR and the ILP with bootstrap sampling (Berg-Kirkpatrick et al., 2012).All differences are significant (p < .01).

Constrained compression experiment
In order to evaluate different approaches to constrained compression, we require a dataset of sentences, constraints and known-good shortenings, which respect the constraints.This means we need tuples (S, Q, b, C g ), where C g is a known-good compression of S which respects Q and b ( §1).
To support large-scale automatic evaluation, we reinterpret a standard compression corpus (Filippova and Altun, 2013) as a collection of input sentences and constrained compressions.The original dataset contains pairs of sentences S and compressions C g , generated using news headlines.For our experiment, we set b equal to the character length of the gold compression C g .We then sample a small number of nouns4 from C g to form a query set Q, approximating both the observed number of tokens and observed parts of speech in real-world search (Jansen et al., 2000;Barr et al., 2008).Sampled Q include reasonable queries like "police, Syracuse", "NHS" and "Hughes, manager, QPR".
By sampling queries and defining budgets in this manner, we create 198,570 training tuples and 9949 test tuples, each of the form (S, Q, b, C g ).Filippova and Altun (2013) define the train/test split.We re-tokenize, parse and tag with CoreNLP v3.8.0 (Manning et al., 2014).We reserve 25,000 training tuples as a validation set.

Model: ILP
We compare our system to a baseline ILP method, presented in Filippova and Altun (2013).This approach represents each edge in a syntax tree with a vector of real-valued features, then learns feature weights using a structured perceptron trained on a corpus of (S, C g ) pairs.5 Learned weights are used to compute a global compression objective, subject to structural constraints which ensure C is a valid tree.This baseline can easily perform constrained compression: at test time, we add optimization constraints specifying that C must include Q, and not exceed length b.
To our knowledge, a public implementation of this method does not exist.We reimplement from scratch using Gurobi Optimization (2018), achieving a test-time, token-level F1 score of 0.76 on the unconstrained compression task, lower than the result (F1 = 84.3)reported by the original authors.There are some important differences between our reimplementation and original approach (described in detail in the appendix).Since VER-TEX ADDITION requires Q and b, we can only compare it to the ILP on the constrained (rather than traditional, unconstrained) compression task.

Models: VERTEX ADDITION
Vertex addition accepts or rejects some candidate vertex v i at each timestep i.We learn such decisions y i ∈ {0, 1} using a corpus of tuples (S, Q, b, C g ) ( §4.1).Given such a tuple, we can always execute an oracle path shortening S to C g by first initializing VERTEX ADDITION and then, at each timestep: (1) choosing v i = h(P i ) and ( 2) We then use decisions from oracle paths to train two models of inclusion decisions, p(y i = 1|v i , C i , P i , S).At test time, we accept v i if p(y i > .5).
Model One.Our VERTEX ADDITION N N model broadly follows neural approaches to transitionbased parsing (e.g.Chen and Manning (2014)): we predict y i using a LSTM classifier with a standard max-pooling architecture (Conneau et al., 2017), implemented with a common neural framework (Gardner et al., 2017).Our classifier maintains four vocabulary embeddings matrixes, corresponding to the four disjoint subsets C i ∪ ¬C i ∪ P i ∪{v i } = V .Each LSTM input vector x t comes from the appropriate embedding for v t ∈ V , depending on the state of the compression system at timestep i.The appendix details network tuning and optimization.
Model Two.Our VERTEX ADDITION LR model uses binary logistic regression,6 with 3 classes of features.
Edge features describe the properties of the edge (u, v i ) between v i ∈ P i and u ∈ C i .We use the edge-based feature function from Filippova and Altun ( 2013), described in detail in the appendix.This allows us to compare the performance of a vertex addition method based on local decisions with an ILP method that optimizes a global objective ( §4.5), using the same feature set.
Stateful features represent the relationship between v i and the compression C i at timestep i. Stateful features include information such as the position of v i in the sentence, relative to the rightmost and left-most vertex in C i , as well as historybased information such as the fraction of the character budget used so far.Such features allow the model to reason about which sort of v i should be added, given Q, S and C i .
Interaction features are formed by crossing all stateful features with the type of the dependency edge governing v i , as well as with indicators identifying if u governs v i , if v i governs u or if there is no edge (u, v i ) in the parse.

Metrics: F1, Latency and SLOR
We measure the token-level F1 score of each compression method against gold compressions in the test set.F1 is the standard automatic evaluation metric for extractive compression (Filippova et al., 2015;Klerke et al., 2016;Wang et al., 2017).
In addition to measuring F1, researchers often evaluate compression systems with human importance and readability judgements (Knight and Marcu, 2000;Filippova et al., 2015).In our setting Q determines the "important" information from S, so importance evaluations are inappropriate.To check readability, we use the automated readability metric SLOR (Lau et al., 2015), which correlates with human judgements (Kann et al., 2018).
We evaluate theoretical gains from VERTEX ADDITION (Table 1) by measuring empirical latency.For each compression method, we sample and compress N = 300, 000 sentences, and record the runtime (in milliseconds per sentence).We observe that runtimes are distributed log-normally (Figure 3), and we thus summarize each sample using the geometric mean.ILP and VERTEX AD-DITION LR share edge feature extraction code to support to fair comparison.We test VERTEX AD-DITION N N using a CPU: the method is too slow for use in search applications in areas without access to specialized hardware (Table 2).The appendix further details latency and SLOR experiments.

Analysis: ABLATED & RANDOM
For comparison, we implement an ABLATED vertex addition method, which learns inclusion decisions using only edge features from Filippova and Altun (2013).ABLATED has a lower F1 score than ILP, which uses the same edge-level information to optimize a global objective: adding stateful and interaction features (i.e.VERTEX ADDITION LR ) improves F1 score.Nonetheless, strong performance from ABLATED hints that edge-level information alone (e.g.dependency type) can mostly guide acceptance decisions.
We also evaluate a RANDOM baseline, which accepts each v i randomly in proportion to p(y i = 1) across training data.RANDOM achieves reasonable F1 because (1) C 0 = Q ∈ C g and (2) F1 correlates with compression rate (Napoles et al., 2011), and b is set to the length of C g .

Future work: practical compression
This work presents a new method for fast queryfocused sentence compression, motivated by the need for query-biased snippets in search engines (Tombros and Sanderson, 1998;Marchionini, 2006).While our approach shows promise in simulated experiments, we expect that further work will be required before the method can be employed for practical, user-facing search.To begin, both our technique and our evaluation ignore the conventions of search user interfaces, which typically display missing words using ellipses.This convention is important, because it allows snippet systems to transparently show users which words have been removed from a sentence.However, we observe that some wellformed compressions are difficult to read when displayed in this format.For instance the sentence "Aristide quickly fled Haiti in September 1991" can be shortened to the well-formed compression "Aristide fled in 1991."But this compression does not read fluidly when using ellipses ("Aristide...fled...in...1991").Human experiments aimed at enumerating the desirable and undesirable properties of compressions displayed in ellipse format (e.g.compressions should minimize number of ellipses?)could help guide user-focused snippet algorithms in future work.
Our method also assumes access to a reliable, dependency parse, and ignores any latency penalties incurred from parsing.In practical settings, both assumptions are unreasonable.Like other NLP tools, dependency parsers often perform poorly on out-of-domain text (Bamman, 2017), and users looking to quickly investigate a new corpus might not wish to wait for a parser.Faster approaches based on low-latency part-of-speech tagging, or more cautious approaches based on syntactic uncertainty (Keith et al., 2018), each offer exciting possibilities for additional research.
Our approach also assumes that a user already knows a reasonable b and reasonable Q for a given sentence S. 7 However, in some cases, there is no well-formed shortening of which respects the requirements.For instance, if Q="Kennedy" and b=15 there is no reasonable shortening for the toy sentence "Kennedy kept running", because the compressions "Kennedy kept" and "Kennedy running" are not well-formed.We look forward to 7 Recall that we simulate b and Q based on the wellformed shortening Cg, see §4.1.
investigating which (Q, S, b) triples will never return well-formed compressions in later work.
Finally, some shortened sentences will modify the meaning of a sentence, but we ignore this important complication in this initial study.In the future, we hope to apply ongoing research into textual entailment (Bowman et al., 2015;Pavlick and Callison-Burch, 2016;McCoy and Linzen, 2018) to develop semantically-informed approaches to the task.

Conclusion
We introduce a query-focused VERTEX ADDI-TION LR method for search user interfaces, with much lower theoretical complexity (and empirical runtimes) than baseline techniques.In search applications, such gains are non-trivial: real users are measurably hindered by interface lags (Nielsen, 1993;Liu and Heer, 2014).We hope that our fast, query-focused method better enables snippet creation at the "pace of human thought" (Heer and Shneiderman, 2012).
announced increase in the price of gas sold to Ukraine Ukraine's dependence on Gazprom left the country vulnerable Kremlin-backed Gazprom transports gas to Europe through Ukraine

Figure 1 :
Figure 1: A search user interface (boxed, top) returns a snippet consisting of three compressions which must contain a users' query Q (bold) and must not exceed b = 75 characters in length.The third compression C was derived from source sentence S (italics, bottom).

Table 2 :
Test results for constrained compression.* Latency is the geometric mean of observed runtimes (in milliseconds per sentence).VERTEX ADDITION LR achieves the highest F1, and also runs 10.73 times faster than the ILP.Differences between all scores for VERTEX ADDITION LR and ILP are significant (p < .01).