A Language Model based Evaluator for Sentence Compression

We herein present a language-model-based evaluator for deletion-based sentence compression and view this task as a series of deletion-and-evaluation operations using the evaluator. More specifically, the evaluator is a syntactic neural language model that is first built by learning the syntactic and structural collocation among words. Subsequently, a series of trial-and-error deletion operations are conducted on the source sentences via a reinforcement learning framework to obtain the best target compression. An empirical study shows that the proposed model can effectively generate more readable compression, comparable or superior to several strong baselines. Furthermore, we introduce a 200-sentence test set for a large-scale dataset, setting a new baseline for the future research.


Introduction
Deletion-based sentence compression aims to delete unnecessary words from source sentence to form a short sentence (compression) while retaining grammatical and faithful to the underlying meaning of the source sentence. Previous works used either machine-learning-based approach or syntactic-tree-based approaches to yield most readable and informative compression (Jing, 2000;Knight and Marcu, 2000;Clarke and Lapata, 2006;McDonald, 2006;Clarke and Lapata, 2008;Filippova and Strube, 2008;Berg-Kirkpatrick et al., 2011;Filippova et al., 2015;Bingel and Søgaard, 2016;Andor et al., 2016;Zhao et al., 2017;Wang et al., 2017). For example, (Clarke and Lapata, 2008) proposed a syntactictree-based method that considers the sentence compression task as an optimization problem by using integer linear programming, whereas (Filippova et al., 2015) viewed the sentence compression task as a sequence labeling problem using the recurrent neural network (RNN), using maximum likelihood as the objective function for optimization. The latter sets a relatively strong baseline by training the model on a large-scale parallel corpus. Although an RNN (e.g., Long short-term memory networks) can implicitly model syntactic information, it still produces ungrammatical sentences. We argue that this is because (i) the labels (or compressions) are automatically yielded by employing the syntactic-tree-pruning method. It thus contains some errors caused by syntactic tree parsing error, (ii) more importantly, the optimization objective of an RNN is the likelihood function that is based on individual words instead of readability (or informativeness) of the whole compressed sentence. A gap exists between optimization objective and evaluation. As such, we are of great interest that: (i) can we take the readability of the whole compressed sentence as a learning objective and (ii) can grammar errors be recovered through a language-model-based evaluator to yield compression with better quality?
To answer the above questions, a syntax-based neural language model is trained on large-scale datasets as a readability evaluator. The neural language model is supposed to learn the correct word collocations in terms of both syntax and semantics. Subsequently, we formulate the deletionbased sentence compression as a series of trialand-error deletion operations through a reinforcement learning framework. The policy network performs either RETAIN or REMOVE action to form a compression, and receives a reward (e.g., readability score) to update the network.
The empirical study shows that the proposed method can produce more readable sentences that preserve the source sentences, comparable or superior to several strong baselines. In short, our contributions are two-fold: (i) an effective syntaxbased evaluator is built as a post-hoc checker, yielding compression with better quality based upon the evaluation metrics; (ii) a large scale news dataset with 1.02 million sentence compression pairs are compiled for this task in addition to 200 manually created sentences. We made it publicly available.

Task and Framework
Formally, deletion-based sentence compression translates word tokens, (w 1 , w 2 , ..., w n ) into a series of ones and zeros, (l 1 , l 2 , ..., l n ), where n refers to the length of the original sentence and l i ∈ {0, 1}. Here, "1" refers to RETAIN and "0" refers to REMOVE. We first converted the word sequence into a dense vector representation through the parameter matrix E. Except for word embedding, (e(w 1 ), e(w 2 ), ..., e(w n )), we also considered the part-of-speech tag and the dependency relation between w i and its head word as extra features.
Each part-ofspeech tag was mapped into a vector representation, (p(w 1 ), p(w 2 ), ..., p(w n )) through the parameter matrix P , while each dependency relation was mapped into a vector representation, (d(w 1 ), d(w 2 ), ..., d(w n )) through the parameter matrix D. Three vector representations are concatenated, [e(w i ); p(w i ); d(w i )] as the input to the next part, policy network. Figure 1 shows the graphical illustration of our model. The policy network is a bi-directional RNN that uses the input [e(w i ); p(w i ); d(w i )] and yields the hidden states in the forward direction, .., h f n ), and hidden states in the backward direction, Then, concatenation of hidden states in both directions, [h f i ; h b i ] are followed by a nonlinear layer to turn the output into a binary probability distribution, where σ is a nonlinear function sigmoid, and W is a parameter matrix.
The policy network continues to sample actions from the binary probability distribution above until the whole action sequence is yielded. In this task, binary actions space is {RETAIN, RE-MOVE}. We turn the action sequence into the predicted compression, (w 1 , w 2 , ..., w m ), by deleting the words whose current action is REMOVE. Then the (w 1 , w 2 , ..., w m ) is fed into a pre-trained evaluator which will be described in the next section.

Syntax-based Evaluator
The syntax-based evaluator should assess the degree to which the compressed sentence is grammatical, through being used as a reward function during the reinforcement learning phase. It needs to satisfy three conditions: (i) grammatical compressions should obtain a higher score than ungrammatical compressions, (ii) for two ungrammatical compressions, it should be able to discriminate them through the score despite the ungrammaticality, (iii) lack of important parts (such as the primary subject or verb) in the original sentence should receive a greater penalty.
We therefore considered an ad-hoc evaluator, i.e., the syntax-based language model (evaluator-SLM) for these requirements. It integrates the part-of-speech tags and the dependency relations in the input, while the output to be predicted is the next word token. We observed that the prediction of the next word could not only be based on the previous word but also the syntactic components, e.g., for the part-of-speech tag, the noun is often followed by a verb instead of an adjective or adverb and the integration of the part-ofspeech tag allows the model to learn such correct word collocations. Figure 2 shows the graphical illustration of the evaluator-SLM where the input is x i = [e(w i ); p(w i ); d(w i )], followed by a bi-directional RNN whose last layer is the Softmax layer used to represent word probability distribution. Similar to (Mousa and Schuller, 2017), we added two special tokens, <S> and </S> in the input so as to stagger the hidden vectors, thus avoiding self-prediction. Finally, we have the following formula as one part of the reward functions in the learning framework.

<S>
X 1 X n-2 X n-1 X 2 X 3 X n </S> W 1 W 2 W n-1 W n Figure 2: Graphical illustration of bi-directional recurrent neural network language model.
where R SLM ∈ [0, 1] and Y is the predicted compression by the policy network.
Further, it is noteworthy that the performance comparison should be based on a similar compression rate 1 (CR) (Napoles et al., 2011), and a smooth reward function are positive integers; e.g. a = 2, b = 2 could lead the compression rate to 0.5) is also used to attain a compressed sentence of similar length.
The total reward is R = R SLM + R CR . By using policy gradient methods (Sutton et al., 2000), the policy network is updated with the following gradient: Where a t ∈ {RETAIN, REMOVE}, is the action token by the policy network, and S t refers to hidden state of the network, [h f i ; h b i ] (section 2.1).

Data
As neural network-based methods require a large amount of training data, we for the first time considered using Gigaword 2 , a news domain corpus. More specifically, the first sentence and the headline of each article are extracted. After data cleansing, we finally compiled 1.02 million sentence and headline pairs (see details here 3 ). It is noteworthy that the headline is not the extractive 1 compression rate is the length of compression divided by the length of the sentence. 2 https://catalog.ldc.upenn.edu/ldc2011t07 3 https://github.com/code4conference/Data compression. Further, we asked two near native English speakers to create 200 extractive compressions for the first 200 sentences of this dataset; using it as the testing set, the first 1,000 sentences (excluding the testing set) is the development set, and the remainder is the training set. To assess the inter-assessor agreements, we computed Cohen 's unweighted κ. The computed unweighted κ was 0.423, reaching a moderate agreement level 4 The second dataset we used was the Google dataset that contains 200,000 sentence compression pairs (Filippova et al., 2015). For the purpose of comparison, we used the very first 1,000 sentences as the testing set, the next 1,000 sentences as the development set, and the remainder as the training set.

Comparison Methods
We choose several strong baselines; the first one is the dependency-tree-based method that considers the sentence compression task as an optimization problem by using integer linear programming 5 . Inspired by (Filippova and Strube, 2008), (Clarke and Lapata, 2008), and (Wang et al., 2017), we defined some constrains: (1) if a word is retained in the compression, its parent should be also retained.
(2) whether a word w i is retained should partly depend on the word importance score that is the product of the TF-IDF score and headline score h(w i ), tf -idf (w i ) · h(w i ) where h(w i ) represents that whether a word (limited to nouns and verbs) is also in the headline. h(w i )=5 if w i is in the headline; h(w i )=1 otherwise. (3) the dependency relations, ROOT, dobj, nsubj, pobj, should be retained as they are the skeletons of a sentence. (4) the sentence length should be over than α but less than β. (5) the depth of the node (word), λdep(w i ), in the dependency tree. (6) the word with the dependency relation amod is to be removed. It is noteworthy that the method is unsupervised.
The second method is the long short-term memory networks (LSTMs) which showed strong promise in sentence compression by (Filippova et al., 2015). The labels were obtained using the dependency tree pruning method (Filippova and Altun, 2013) and the LSTMs were applied in a supervised manner. Following their works, we also consider the labels yielded by our dependencytree-based method as pseudo labels and employ LSTMs as a baseline.
Furthermore, for a comprehensive comparison, we applied the sequence-to-sequence with attention method widely used in abstractive text summarization for sentence compression. Previous works such as (Rush et al., 2015;Chopra et al., 2016) have shown promising results with this framework, although the focus was generationbased summarization rather than extractive summarization. More specifically, the source sequence of this framework is the original sentence, while the target sequence is a series of zeros and ones (zeros represents REMOVE and ones represents RETAIN). Further, we incorporated dependency labels and part-of-speech tag features in the source side of the sequence-to-sequence method.

Training
The embedding size for word, part-of-speech tag, and the dependency relation is 128. We employed the vanilla RNN with a hidden size of 512 for both the policy network and neural language model. The mini-batch size was chosen from [5, 50, 100]. Vocabulary size was 50,000. The learning rate for neural language model is 2.5e-4, and 1e-05 for the policy network. For policy learning, we used the REINFORCE algorithm (Williams, 1992) to update the parameters of the policy network and find an policy that maximizes the reward. Because starting from a random policy is impractical owing to the high variance, we pre-trained the policy network using pseudo labels in a supervised manner. For the comparison methods, the hyperparameters and were set to 0.4 and 0.7, respectively, and was set to 0.5. For reproduction, we released the source code here 6 . 6 https://github.com/code4conference/code4sc

Result and Discussion
This section demonstrates the experimental results on both datasets. As the Gigaword dataset has no ground truth, we evaluated the baseline and our method on the 200-sentence test sets created by two human annotators. For the automatic evaluation, we employed F 1 and RASP-F 1 (Briscoe and Carroll, 2002) to measure the performances. The latter compares grammatical relations (such as ncsubj and dobj ) found in the system compressions with those found in the gold standard, providing a means to measure the semantic aspects of the compression quality. For the human evaluation, we asked two near native English speakers to assess the quality of 50 compressed sentences out of the 200-sentence test set in terms of readability and informativeness. Here are our observations:   Table 3: F 1 and RASP-F 1 results for Google dataset.
(1) As shown in Table 1, our Evaluator-SLMbased method yields a large improvement over the baselines, demonstrating that the language-modelbased evaluator is effective as a post-hoc grammar checker for the compressed sentences. This is also validated by the significant improvement in the readability score in  (2) by comparing annotator 1 with annotator 2 in Table 1, we observed different performances for two annotated test sets, showing that compressing a text while preserving the original sentence is subjective across the annotators.
(3) As for Google news dataset, LSTMs (LSTM+pos+dep) (&3) is a relatively strong baseline, suggesting that incorporating dependency relations and part-of-speech tags may help model learn the syntactic relations and thus make a better prediction. When further applying Evaluator-SLM, only a tiny improvement is observed (&3 vs &4), not comparable to the improvement between #3 and #5. This may be due to the difference in perplexity of the our Evaluator-SLM. For Gigaword dataset with 1.02 million instances, the perplexity of the language model is 20.3, while for the Google news dataset with 0.2 million instances, the perplexity is 76.5.
(4) To further explore the degree to which syntactic knowledge (dependency relations and partof-speech tags) is helpful to evaluator (language model), we implemented a naive language model, i.e., Evaluator-LM, which did not include dependency relations and part-of-speech tags as input features. The results shows that small improvements are observed on two datasets (#4 vs #5; &4 vs &5), suggesting that incorporating syntactic knowledge may help evaluator to encourage more unseen but reasonable word collocations.

Evaluator Analysis
To further analyze the Evaluator-SLM performance, we used an example sentence, "The Dalian shipyard has built two new huge ships" to observe how a language model scores different word deletion operations. We converted the reward function R SLM to e −logR SLM for a better observation (sim-ilar to "sentence perplexity", the higher the score is, the worse is the sentence). As shown in Figure  3, deleting the object(#2), verb(#3), or subject(#4) results in a significant increase in "sentence perplexity", implying that the syntax-based language model is highly sensitive to the lack of such syntactic components. Interestingly, when deleting words such as new or/and huge, the score becomes lower, suggesting that the model may prefer short sentences, with unnecessary parts such as amod being removed. This property makes it quite suitable for the sentence compression task aiming to shorten sentences by removing unnecessary words.

Conclusion
We presented a syntax-based language model for the sentence compression task. We employed unsupervised methods to yield labels to train a policy network in a supervised manner. The experimental results demonstrates that the compression could be further improved by a post-hoc language-model-based evaluator, and our evaluator-enhanced model performs better or comparable upon the evaluation metrics on two largescale datasets.