SCAR: Sentence Compression using Autoencoders for Reconstruction

Sentence compression is the task of shortening a sentence while retaining its meaning. Most methods proposed for this task rely on labeled or paired corpora (containing pairs of verbose and compressed sentences), which is often expensive to collect. To overcome this limitation, we present a novel unsupervised deep learning framework (SCAR) for deletion-based sentence compression. SCAR is primarily composed of two encoder-decoder pairs: a compressor and a reconstructor. The compressor masks the input, and the reconstructor tries to regenerate it. The model is entirely trained on unlabeled data and does not require additional inputs such as explicit syntactic information or optimal compression length. SCAR’s merit lies in the novel Linkage Loss function, which correlates the compressor and its effect on reconstruction, guiding it to drop inferable tokens. SCAR achieves higher ROUGE scores on benchmark datasets than the existing state-of-the-art methods and baselines. We also conduct a user study to demonstrate the application of our model as a text highlighting system. Using our model to underscore salient information facilitates speed-reading and reduces the time required to skim a document.


Manish Shrivastava
IIIT Hyderabad m.shrivastava @iiit.ac.in Abstract Sentence compression is the task of shortening a sentence while retaining its meaning. Most methods proposed for this task rely on labeled or paired corpora (containing pairs of verbose and compressed sentences), which is often expensive to collect. To overcome this limitation, we present a novel unsupervised deep learning framework (SCAR) for deletion-based sentence compression. SCAR is primarily composed of two encoder-decoder pairs: a compressor and a reconstructor. The compressor masks the input, and the reconstructor tries to regenerate it. The model is entirely trained on unlabeled data and does not require additional inputs such as explicit syntactic information or optimal compression length. SCAR's merit lies in the novel Linkage Loss function, which correlates the compressor and its effect on reconstruction, guiding it to drop inferable tokens. SCAR achieves higher ROUGE scores on benchmark datasets than the existing stateof-the-art methods and baselines. We also conduct a user study to demonstrate the application of our model as a text highlighting system. Using our model to underscore salient information facilitates speed-reading and reduces the time required to skim a document.

Introduction
Our fast-paced lifestyle precludes us from reading verbose and lengthy documents. How about a system that highlights the salient content for us (as shown in Fig.1)? We model this problem as the well-known sentence compression task. Sentence compression aims to generate a shorter representation of the input that captures its gist and preserves its intent. Compression algorithms are broadly classified as abstractive and extractive. Extractive compression or deletion-based algorithms only select relevant words from the input, whereas abstractive compression algorithms also allow paraphrasing. Figure 1: An example of a system that highlights the salient content, allowing the user to skim through the document quickly.
In the past, compression approaches have revolved around statistical methods (Knight and Marcu, 2000) and syntactic rules (McDonald, 2006). Current state-of-the-art methods model the problem as a sequence-to-sequence learning task (Filippova et al., 2015). Although these methods perform well, they require massive parallel training datasets that are difficult to collect (Filippova and Altun, 2013). Recently, unsupervised approaches have been explored to overcome this limitation. Fevry and Phang (2018) model compression as a denoising task but barely reach the baselines. Baziotis et al. (2019) propose SEQ 3 , an autoencoder which uses a Gumbel-softmax to represent the distribution over summaries. But a qualitative analysis of their outputs shows that SEQ 3 mimics the lead baseline.
In this work, we present an unsupervised deep learning framework (SCAR) for deletion-based sentence compression. SCAR is composed of a compressor and a reconstructor. For each word in the input, the compressor determines whether or not to include it in the compression. A length loss restricts the compression length. The reconstruc-tor tries to regenerate the input using the words retained by the compressor. A reconstruction loss motivates the compressor to include words that aid in reconstruction. However, without an additional loss to govern word masking, the network fails to converge. We introduce a novel linkage loss that ties together the compressor and the reconstructor. It penalizes the network if a) it decides to drop a word but is unable to reconstruct it or b) it decides to include a word which it could reconstruct easily.

Related Work
Early compression algorithms were formulated using strong linguistic priors and language heuristics (Jing, 2000;Knight and Marcu, 2002;Dorr et al., 2003;Cohn and Lapata, 2008). McDonald (2006) use syntactical evidence to condition the output of the model. Berg-Kirkpatrick et al. (2011) prune dependency edges to remove constituents for compression.
Deep learning-based approaches have gained popularity owing to their success in core NLP tasks such as machine translation (Bahdanau et al., 2014). Filippova et al. (2015) propose an RNN based encoder-decoder network for deletion based compression. Although this approach achieves superior performance over metric-based approaches, a large amount of paired sentences are needed to train the network.
The first attempt to reduce the dependence on paired corpora for deletion based deep learning compression models was made by Miao and Blunsom (2016). They train separate compressor and reconstruction models, to allow for both supervised and unsupervised training. The compressor consists of a discrete variational autoencoder. The model is trained end-to-end using the REINFORCE algorithm. However, the reported results still use a sizeable amount of labeled data.
Recent approaches have sought completely unsupervised solutions. Fevry and Phang (2018) use a denoising autoencoder (DAE) for sentence compression. The input sentence is shuffled and extended to add noise. DAE tries to reconstruct the original denoised sentence from the noisy input. An additional signal is needed to specify the output length. At test time, the sentence is fed to the model without any noise. In an attempt to denoise the input, the network generates a compressed output. However, the model often fails to capture the information present in the input and is barely able to reach the baselines.
SEQ 3 (Baziotis et al., 2019) proposes an autoencoder using a Gumbel-softmax to represent the distribution over summaries. A compressor generates a summary, and a reconstructor tries to reconstruct the input using the summary. A pre-trained language model acts as a prior, to incentivize the compressor to produce human-readable summaries. An additional topic loss is required to ensure that the summary contains relevant words, making the model non-generic and fine-tuned to the domain. A qualitative analysis of the outputs shows that SEQ 3 merely mimics the lead baseline and generates compressions by blindly copying a prefix of the input.

SCAR
SCAR is composed of two encoder-decoder pairs: compressor C and reconstructor R, as shown in Fig. 2. Given an input sentence s = w 1 , w 2 ..., w k containing k words, C generates an indicator vector I v = I v1 , I v2 , ..., I vk which indicates the presence/absence of each word in the summary. The summary is represented as s = s I v , where represents element-wise multiplication. Therefore, words corresponding to I vi ≈ 0 are effectively skipped. The network tries to reconstruct the input sentence from s .
Formally, the network tries to find an I * v such that the probability p(s|s I v ) is maximized and k t=1 I vt is minimized, jointly. The probability p(s|s I v ) can be decomposed further as shown in Eq.(1) For every word in the sentence, we learn a 300-dimensional embedding initialized with GloVe (Pennington et al., 2014). These embeddings are sequentially fed as input to the Sentence Encoder (E s ), composed of a bi-LSTM. The input is fed forwards and backward. The hidden states are a concatenation of the forward and backward states. The sentence representation is obtained from the final hidden state of E s (i.e., h e1 ). The Indicator Extraction Module (IEM), a bi-LSTM decoder, is initialized using h e1 . The output of this decoder at each time step is passed through a network of two fully connected layers to generate a single indicator value. We intend this value to be close to either one or zero, denoting the presence/absence of each word from the summary.
The masked sentence, s = s I v , is encoded using the Summary Encoder (E s ), composed of a bi-LSTM. The Summary Decoder (D s ), also a bi-LSTM, is initialized using the final hidden state of E s (h e2 ). This decoder aims to regenerate the input sentence s from s . This motivates IEM to generate I v such that s can be easily reconstructed. The output at each time step in D s is fed to a dense layer, W s , which computes a distribution over the vocabulary from the decoder's hidden states.

Loss functions
Compression Length loss (L len ) is used to constrain the summary length. It is calculated from the output of IEM as shown in Eq. (2). Len(s ) is the sum of elements of I v . We set r = 0.4 in our experiments.
Sentence Reconstruction loss (L rec ) is applied to ensure s contains enough information to reconstruct s. It is calculated from the output of D s as shown in Eq. (3).
To help ease reconstruction, L rec steers the network to keep larger summaries, whereas L len forces it to it cut down. This makes it hard for the model to converge optimally. We introduce a novel Linkage loss (L lnk ), which correlates the indicator vector and its effect on reconstruction. It penalizes the network if a) it decides to mask a word but is unable to reconstruct it or b) it decides to include a word which it could reconstruct easily. It is applied to the outputs of IEM and D s , as shown in Eq. (4).
The variable χ i ∈ [0, 1], in Eq. (5), is the normalized value of a word's logit in a sentence. It denotes the relative difficulty of decoding word w i , given w <i and h e2 . L lnk is minimized when either a) χ i = 0 and I vi = 0 (signifying that w i is easy to decode and should be dropped) or b) χ i = 1 and I vi = 1 (signifying that hard-to-decode words should be retained). The effect of L lnk can be seen in Fig. 3. The model retains words with a higher χ i (dark green), whereas words with a lower χ i (light green) can be inferred during reconstruction and therefore dropped.
Binarization loss (L bin ) is applied to the output of IEM, as shown in Eq. (6), to push the values of I v close to 0 and 1 (since setting them to these hard values directly introduces non-differentiability). In our experiments, b is set to 5 and a is such that L bin is always non-negative. At test time, only the words with I vi > 0.5 are included in the compression.

Re-weighting Vocabulary Distribution
Due to the nature of Zipf's law (Zipf, 1949), most of the probability mass in the vocabulary distribution output by the Summary Decoder is retained by stopwords. As a result, χ i corresponding to stopwords is much lower compared to content words. This causes the network to blindly drop stopwords and retain most content words. In this case, many content words that may be inferable are not dropped. To remedy this, we introduce Stop Predictor (D stop ), which assigns a score to the next word based on whether it is a stopword or not. When the network believes that the next word is not a stopword, it re-distributes the probability mass from stopwords proportionally among content words and vice-versa.
The word embeddings' of s are sequentially fed as input to D stop , a bi-LSTM decoder. The output of D stop at each time step is passed through a network of two fully connected layers to generate a single score, y stop,i ∈ [0, 1]. In order to train D stop we apply L stp (mean-square-error loss with the ground truth) as shown in Eq.(7). The ground truth is obtained from the stopword-list, defined as the collection of 50 most frequent words (0.25% of the vocabulary size) found in the dataset.
We re-weight the vocabulary distribution using y stop,i , similar to p gen in (See et al., 2017), as shown in Eq. (8). I s is a vocabulary sized vector with the 50 elements of stopword-list set to 1 and the rest to 0.
This re-weighted distribution is plugged into Eq.(5) and used to calculate L lnk .
The final loss function (L) is a linear combination of the above losses. Since this is an unsupervised approach, currently, the weights are experimentally determined. Initial weights for each loss were selected to normalize the output range of all loss functions. We performed a grid search in the neighborhood of these initial weight values to determine optimal weights that maximized the ROUGE scores on the validation set. The weights have been set to 8 (L len ), 1 (L rec ), 5 (L lnk ), 100 (L bin ) and 10 (L stp ) in our experiments.

Training
In our experiments, we used the annotated Gigaword corpus (Rush et al., 2015). The model is trained only on the reference section. We only considered sentences where the length was between 15 and 40 words (3.5M samples). A small portion of the training set (200k samples) was held out for validation. The batch size is set to 128. Vocabulary is restricted to 20000 most frequent words from the dataset. All bi-LSTM cells are of size 600 and weights are initialized normally N (µ = 0, σ = 0.1). The output from IEM and D stop is passed through a hidden layer (150 units) and an output layer with ReLU and sigmoid activations, respectively. We use Adam optimizer (Kingma and Ba) (lr=0.001, β 1 =0.9 and β 2 =0.999). Gradients larger than 1.0 are clipped. The model is trained for 5 epochs using early stopping by monitoring the performance on the validation set. 1 bytes. We report average ROUGE (1,2,L) F1 scores (Lin, 2004) obtained by all the models in Table 1. We compare our model with three standard baselines -Prefix (first 8 words for Gigaword/first 75 bytes for DUC), Lead50 (50% tokens) and All-Text (entire input). To compare with supervised approaches, we train a baseline Seq2Seq model, similar to (Fevry and Phang, 2018). Finally, we compare our model with the recent unsupervised approaches, DAE (Fevry and Phang, 2018) 2 , and SEQ 3 (Baziotis et al., 2019) 3 .

Pitfalls of SEQ 3
Lead50 achieves the highest ROUGE scores, but it does not make for a viable compression method as it blindly drops the latter half of the sentence. The scores obtained by SEQ 3 are strikingly similar to Lead50. The authors of SEQ 3 note that "the model tends to copy the first words of the input sentence in the compressed text". We observed that SEQ 3 introduces very little abstractiveness (only 0.001% of the words are different from the input) and copies the first half of the sentence.
To corroborate our findings, we introduce the notion of summary coverage. It is a measure of how well each position of the input is represented in the compression. We divide the input sentence into equal-sized segments and measure how often 2 https://github.com/zphang/usc dae 3 https://github.com/cbaziotis/seq3.git Figure 4: We divide the input sentence into equal-sized segments and measure how often each segment (x-axis) is included in the compression (y-axis). each segment is included in the compression. We plot the summary coverage for Lead50, SEQ 3 , and SCAR, as shown in Fig.4. A visualization is shown in Fig.5. Lead50 and SEQ 3 only cover the first half (initial segments) of the input, leading to incomplete/incorrect compressions. SCAR has more uniform coverage and represents all segments of the input well, leading to more informative compressions.

Quantitative evaluation
Given the pitfalls of SEQ 3 , SCAR achieves stateof-the-art performance in unsupervised sentence compression on Gigaword and DUC datasets. SCAR's R-2 scores on both benchmark sets are low because it tends to drop the inferable portion LEAD50: malaysia 's government on monday announced an immediate ##-million dollar plan to expand roads , build underground bypasses and overhead bridges to ease kuala lumpur 's traffic jams . SEQ 3 : malaysia 's government on monday announced an immediate ##-million dollar plan to expand roads , build underground bypasses and overhead bridges to ease kuala lumpur 's traffic jams . SCAR malaysia 's government on monday announced an immediate ##-million dollar plan to expand roads , build underground bypasses and overhead bridges to ease kuala lumpur 's traffic jams .
Headline: malaysia announces ##-million dollar plan to ease kuala lumpur traffic woes Figure 5: Visualization of summary coverage by overlaying the compressions onto the reference.

Highlight)
president bill clinton this week unveils a budget proposal offering nearly ### billion dollars in tax relief over the next six years and calling for the elimination of the federal deficit by #### .

Qualitative evaluation
ROUGE only measures the content overlap and does not account for coherence. We conduct a Qualitative study to address the known issues with ROUGE (Schluter, 2017) and evaluate SCAR's effectiveness as a speed reading system. Human evaluators are asked to match the reference/compression that they are shown with the correct headline from a set of 5 options. 3 incorrect options are generated by selecting Gigaword headlines that share tokens with the reference. The fifth option is "unsure." Fifteen English speaking participants were divided into 5 sets. They were shown the reference (1), the reference with SCAR highlighting (2), compressions generated by SCAR (3), SEQ 3 (4), and DAE (5), respectively. Each user was asked to match 10 samples.
An example is shown in Fig.6. Compressions generated by DAE fail to preserve the meaning and intent of the reference. SEQ 3 habitually retains the first half of the input, and the evaluators fail to match the headline if it corresponds to the latter half. Due to collocation, SCAR tends to drop the inferable portion of a bi-gram. For example, "Bill" is retained, and "Clinton" is dropped. The average correctness and time scores are reported in Table  2. Compared to other compressions, SCAR has the highest score in terms of correctness. Using SCAR to highlight, reduces reading time by 25%.

Conclusion and Future Work
SCAR addresses a significant limitation of the unavailability of labeled data for sentence compression. It outperforms the existing state-of-the-art unsupervised models. Since SCAR learns to drop inferable components of the input and therefore reduces noise, it can be used as a preprocessing step for machine translation and other information retrieval tasks.