Higher-Order Syntactic Attention Network for Longer Sentence Compression

A sentence compression method using LSTM can generate fluent compressed sentences. However, the performance of this method is significantly degraded when compressing longer sentences since it does not explicitly handle syntactic features. To solve this problem, we propose a higher-order syntactic attention network (HiSAN) that can handle higher-order dependency features as an attention distribution on LSTM hidden states. Furthermore, to avoid the influence of incorrect parse results, we trained HiSAN by maximizing jointly the probability of a correct output with the attention distribution. Experimental results on Google sentence compression dataset showed that our method achieved the best performance on F1 as well as ROUGE-1,2 and L scores, 83.2, 82.9, 75.8 and 82.7, respectively. In human evaluation, our methods also outperformed baseline methods in both readability and informativeness.


Introduction
Sentence compression is the task of compressing long sentences into short and concise ones by deleting words. To generate compressed sentences that are grammatical, many researchers (Jing, 2000;Knight and Marcu, 2000;Berg-Kirkpatrick et al., 2011;Filippova and Altun, 2013) have adopted tree trimming methods. Even though Filippova and Altun (2013) reported the best results on this task, automatic parse errors greatly degrade the performances of these tree trimming methods. 1 We used an LSTM-based sentence compression method (Filippova et al., 2015) in the evaluation setting as described in Section 4.1. Recently, Filippova et al. (2015) proposed an LSTM sequence-to-sequence (Seq2Seq) based sentence compression method that can generate fluent sentences without utilizing any syntactic features. Therefore, Seq2Seq based sentence compression is a promising alternative to tree trimming.
However, as reported for a machine translation task Pouget-Abadie et al., 2014;Koehn and Knowles, 2017), the longer the input sentences are, the worse the Seq2Seq performances become. We also observed this problem in the sentence compression task. As shown in Figure 1, the performance of Seq2Seq is degraded when compressing long sentences. In particular, the performance significantly falls if sentence length exceeds 26 words. This is an important problem, because sentences longer than the average sentence length (=28 words) accounts for 42% of the Google sentence compression dataset.
As shown in Figure 2, long sentences have deep Figure 3: An example compressed sentence and its dependency tree. The words colored by gray represent deleted words.
dependency trees, which have long distances from root node to words at leaf nodes. Therefore, improving compression performance for sentences with such deep dependency trees can help to compress longer sentences.
To deal with sentences that have deep dependency trees, we focus on the chains of dependency relationships. Figure 3 shows an example of a compressed sentence with its dependency tree. The topic of this sentence is import agreement related to electricity. Thus, to generate informative compression, the compressed sentence must retain the country name. In this example, the compressed sentence should keep the phrase "from Kyrgyz Republic and Tajikistan". Thus, the compressed sentence must also keep the dependency chain "import", "resolution" and "signed" because the phrase is a child of this chain. By considering such higher-order dependency chains, the system can implement informative compression. As can be seen from the example in Figure 3, tracking a higher-order dependency chain for each word would help to compress long sentences. This paper refers to such dependency relationships by the expression "d-length dependency chains".
To handle a d-length dependency chain for sentence compression with LSTM, we propose the higher-order syntactic attention network (HiSAN). HiSAN computes the deletion probability for a given word based on the d-length dependency chain starting from the word. The d-length dependency chain is represented as an attention distribution, learned using automatic parse trees. To alleviate the influence of parse errors in automatic parse trees, we learn the attention distribution together with deletion probability.
Evaluation results on the Google sentence compression dataset (Filippova and Altun, 2013) show that HiSAN achieved the best F 1 , ROUGE-1,2 and L scores 83.2, 82.9, 75.8 and 82.7, respectively. In particular, HiSAN attained remarkable compression performance with long sentences. In human evaluations, HiSAN also outperformed the baseline methods.

Baseline Sequence-to-Sequence Method
Sentence compression can be regarded as a tagging task, where given a sequence of input tokens x = (x 0 , ..., x n ), a system assigns output label y t , which is one of three types of specific labels ("keep,""delete," or"end of sentence") to each input token x t (1 ≤ t ≤ n).
The LSTM-based approaches for sentence compression are mostly based on the bi-LSTM based tagging method (Tagger) (Klerke et al., 2016;Wang et al., 2017;Chen and Pan, 2017) or Seq2Seq (Filippova et al., 2015;Tran et al., 2016). Tagger independently predicts labels in a point estimation manner, whereas Seq2Seq predicts labels by considering previously predicted labels. Since Seq2Seq is more expressive than Tagger, we built HiSAN on the baseline Seq2Seq model.
Our baseline Seq2Seq is a version of Filippova et al. (2015) extended through the addition of bi-LSTM, an input feeding approach Luong et al., 2015), and a monotonic hard attention method (Yao and Zweig, 2015;Tran et al., 2016). As described in the evaluations section, this baseline achieved comparable or even better scores than the state-of-the-art scores reported in Filippova et al. (2015). The baseline Seq2Seq model consists of embedding, encoder, decoder, and output layers.
In the embedding layer, the input tokens x are converted to the embeddings e. As reported in Wang et al. (2017), syntactic features are important for learning a generalizable embedding for sentence compression. Following their results, we also introduce syntactic features into the embedding layer. Specifically, we combine the surface token embedding w i , POS embedding p i , and dependency relation label embedding r i into a single vector as follows: where [] represents vector concatenation, and e i is an embedding of token x i . The encoder layer converts the embedding e into a sequence of hidden states h = (h 0 , ..., h n ) using a stacked bidirectional-LSTM (bi-LSTM) as follows: where LST M− → θ and LST M← − θ represent forward and backward LSTM, respectively. The final state of the backward LSTM ← − h 0 is inherited by the decoder as its initial state.
In the decoder layer, the concatenation of a 3bit one-hot vector which is determined by previously predicted label y t−1 , previous final hidden state d t−1 (explained later), and the input embedding of x t , is encoded into the decoder hidden state − → s t using stacked forward LSTMs.
Contrary to the original softmax attention method, we can deterministically focus on one encoder hidden state h t (Yao and Zweig, 2015) to predict y t in the sentence compression task (Tran et al., 2016). 3 In the output layer, label probability is calculated as follows: where W o is a weight matrix of the softmax layer and δ yt is a binary vector where the y t -th element is set to 1 and the other elements to 0.

Higher-order Syntactic Attention Network
The key component of HiSAN is its attention module. Unlike the baseline Seq2Seq, HiSAN employs a packed d-length dependency chain as distributions in the attention module. Section 3.1 explains the packed d-length dependency chain. Section 3.2 describes the network structure of our attention module, and Section 3.3 explains the learning method of HiSAN.

Packed d-length Dependency Chain
The probability for a packed d-length dependency chain is obtained from a dependency graph, which is an edge-factored dependency score matrix (Hashimoto and Tsuruoka, 2017;Zhang et al., 2017). First, we explain the dependency graph. Figure 4 (a) shows an example of the dependency graph. HiSAN represents a dependency graph as an attention distribution generated by the attention module. A probability for each dependency edge is obtained from the attention distribution. Figure 4 (b) shows an example of the packed d-length dependency chain. With our recursive attention module, the probability for a packed dlength dependency chain is computed as the sum of probabilities for each path yielded by recursively tracking from a word to its d-th ancestor. The probability for each path is calculated as the product of the probabilities of tracked edges. The probability for the chain can represent several dlength dependency chains compactly, and so alleviates the influence of incorrect parse results. This is the advantage of using dependency graphs. Figure 5 shows the prediction process of HiSAN. In this figure, HiSAN predicts output label y 7 from the input sentence. The prediction process of HiSAN is as follows.

Network Architecture
1. Parent Attention module calculates P parent (x j |x t , x), the probability of x j being the parent of x t , by using h j and h t . This probability is calculated for all pairs of x j , x t . The arc in Figure 5 shows the most probable dependency parent for each child token.
x 0 Figure 5: Prediction process of our higher-order syntactic attention network.

Recursive Attention module calculates
α d,t,j , the probability of x j being the d-th order parent (d denotes the chain length) of x t , by recursively using P parent (x j |x t , x). α d,t,j is also treated as an attention distribution, and used to calculate γ d,t , the weighted sum of h for each length d. For example, a 3-length dependency chain of word x 7 with highest probability is x 6 -x 5 -x 2 . The encoder hidden states h 6 , h 5 and h 2 , which correspond to the dependency chain, are weighted by calculated parent probabilities α 1,7,6 , α 2,7,5 and α 3,7,2 , respectively, and then fed to the selective attention module.
3. Selective Attention module calculates weight β d,t from its length, d ∈ d, for each γ d,t . d represents a group of chain lengths. β d,t is calculated by encoder and decoder hidden states. Each β d,t · γ d,t is summed to Ω t , the output of the selective attention module.
4. Finally, the calculated Ω t is concatenated and input to the output layer.
Details of each module are explained in the following subsection.

Parent Attention Module
Zhang et al. (2017) formalized dependency parsing as the problem of independently selecting the parent of each word in a sentence. They produced a distribution over possible parents for each child word by using the attention layer on bi-LSTM hidden layers. In a dependency tree, a parent has more than one child. Under this constraint, dependency parsing is represented as follows. Given sentence S = (x 0 , x 1 , ..., x n ), the parent of x j is selected from S \ x i for each token S \ x 0 . Note that x 0 denotes the root node. The probability of token x j being the parent of token x t in sentence x is calculated as follows: where v a , U a and W a are weight matrices of g. Different from the attention based dependency parser, P parent (x j |x t , x) is jointly learned with output label probability P (y | x) in the training phase. Training details are given in Section 3.3.

Recursive Attention Module
The recursive attention module recursively calculates α d,t,j , the probability of x j being the d-th order parent of x t , as follows: . (9) Furthermore, in a dependency parse tree, root should not have any parent, and a token should not depend on itself. In order to satisfy these rules, we impose the following constraints on α 1,t,j : The 1st and 2nd lines of Eq. (10) represent the case that the parent of root is also root. These constraints imply that root does not have any parent. The 3rd line of Eq. (10) prevents a token from depending on itself. Because the 1st line of Eq. (9) is similar to the definition of matrix multiplication, Eq. (9) can be efficiently computed on a CPU and GPU 4 .
By recursively using the single attention distribution, it is no longer necessary to prepare additional attention distributions for each order when computing the probability of higher order parents. Furthermore, since it is not necessary to learn multiple attention distributions, it becomes unnecessary to use hyper parameters for adjusting the weight of each distribution in training. Finally, this method can avoid the problem of sparse higher-order dependency relations in the training dataset.
The above calculated α d,t,j is used to weight the bi-LSTM hidden layer h as follows: Note that γ d,t is inherited by the selective attention module, as explained in the next section.

Selective Attention Module
To select suitable dependency orders of the input sentence, the selective attention module weights and sums the hidden states γ d,t to Ω t by using weighting parameter β d,t , according to the current context as follows: where W c is the weight matrix of the softmax layer, d is a group of chain lengths, c t is a vector representing the current context, γ 0,t is a zero-vector, and β 0,t indicates the weight when the method does not use the dependency features. Context vector c t is calculated as The calculated Ω t is concatenated and input to the output layer. In detail, d t in Eq. (5) is replaced by concatenated vector t is also fed to the input of the decoder LSTM at t + 1.

Objective Function
To alleviate the influence of parse errors, we jointly update the 1st-order attention distribution α 1,t,k and label probability P (y|x) (Kamigaito et al., 2017). The 1st-order attention distribution is learned by dependency parse trees. If a t,j = 1 is an edge between parent word w j and child w t on a dependency tree (a t,j = 0 denotes that w j is not a parent of w t .), the objective function of our method can be defined as: (14) where λ is a hyper-parameter that controls the importance of the output labels and parse trees in the training dataset.

Dataset
This evaluation used the Google sentence compression dataset (Filippova and Altun, 2013) 5 . This dataset contains information of compression labels, part-of-speech (POS) tags, dependency parents and dependency relation labels for each sentence.
We used the first and last 1,000 sentences of comp-data.eval.json as our test and development datasets, respectively. Note that our test dataset is compatible wth that used in previous studies (Filippova et al., 2015;Tran et al., 2016;Klerke et al., 2016;Wang et al., 2017).
In this paper, we trained the following baselines and HiSAN on all sentences of sent-comp.train * .json (total 200,000 sentences) 6,7,8 .
In our experiments, we replaced rare words that appear fewer than 10 times in our training dataset with a special token ⟨UNK⟩. After this filtering, the input vocabulary size was 23, 168.

Baseline Methods
For a fair comparison of HiSAN, we used the input features described in Eq. (1) for the following baseline methods: 5 https://github.com/google-research-datasets/sentencecompression 6 Note that Filippova et al. (2015) used 2,000,000 sentences for training their method, but these datasets are not publicly available. 7 We also demonstrate an experimental evaluation on a small training set (total 8,000 sentences), that was used in previous research. The results of this setting are listed in our supplemental material. 8 Note that the large training dataset lacks periods at the end of compressed sentences. To unify the form of compressed sentences in small and large settings, we added periods to the end of compressed sentences in the large training dataset.
Tagger: A method that regards sentence compression as a tagging task based on bi-LSTM (Klerke et al., 2016;Wang et al., 2017).
Tagger+ILP: An extension of Tagger that integrates ILP (Integer Linear Programming)-based dependency tree trimming (Wang et al., 2017). We set their positive parameter λ to 0.2.
Bi-LSTM: A method that regards sentence compression as a sequence-to-sequence translation task proposed by (Filippova et al., 2015). For a fair comparison, we replaced their one-directional LSTM with the more expressive bi-LSTM in the encoder part. The initial state of the decoder is set to the sum of the final states of the forward and backward LSTMs.
Bi-LSTM-Dep: An extension of Bi-LSTM that exploits features obtained from a dependency tree (named LSTM-PAR-PRES in Filippova et al. (2015)). Following their work, we fed the word embedding and the predicted label of a dependency parent word to the current decoder input of Bi-LSTM.
Base: Our baseline Seq2Seq method described in Section 2.
Attn: An extension of the softmax based attention method (Luong et al., 2015). We replaced h t in Eq. (6) with the weighted sum calculated by the commonly used concat attention (Luong et al., 2015).

HiSAN-Dep:
A variant of HiSAN that utilizes the pipeline approach. We fix α 1,j,t to 1.0 if x j is a parent of x t in the input dependency parse tree, 0.0 otherwise. In this baseline, d = {1} was used.

Training Details
Following the previous work (Wang et al., 2017), the dimensions of the word embeddings, LSTM layers, and attention layer were set to 100. For the Tagger-style methods, the depth of the LSTM layer was set to 3, and for the Seq2Seq-style methods, the depth of the LSTM layer was set to 2. In this setting, all methods have a total of six LSTM-layers. The dimensions of POS and the dependency-relation label embeddings were set to 40. All parameters were initialized by Glorot and Bengio (2010)'s method. For all methods, we applied Dropout (Srivastava et al., 2014) to the input of the LSTM layers. All dropout rates were set to 0.3. During training, the learning rate was tuned with Adam (Kingma and Ba, 2014). The initial learning rate was set to 0.001. The maximum number of training epochs was set to 30. The hyper-parameter λ was set to 1.0 in the supervised attention setting. All gradients were averaged in each mini-batch. The maximum mini-batch size was set to 16. The order of mini-batches was shuffled at the end of each training epoch. The clipping threshold of the gradient was set to 5.0. We selected trained models with early stopping based on maximizing per-sentence accuracy (i.e., how many compressions could be fully reproduced) of the development data set.
To obtain a compressed sentence, we used greedy decoding, rather than beam decoding, as the latter attained no gain in the development dataset. All methods were written in C++ on Dynet (Neubig et al., 2017).

Automatic Evaluation
In the automatic evaluation, we used token level F 1 -measure (F 1 ) as well as recall of ROUGE-1, ROUGE-2 and ROUGE-L (Lin and Och, 2004) 9 as evaluation measures.
We used ∆C = system compression ratio − gold compression ratio to evaluate how close the compression ratio of system outputs was to that of gold compressed sentences. The average compression ratio of the gold compression for input sentence was 39.8. We used micro-average for F 1 -measure and compression ratio 10 , and macroaverage for ROUGE scores, respectively.
To verify the benefits of our methods on long sentences, we additionally report scores on sentences longer than the average sentence length (= 28) in the test set. The average compression ratio of the gold compression for longer input sentences was 31.4.
All results are reported as the average scores of five trials. In each trial, different random choices were used to generate the initial values of the embeddings and the order of mini-batch processing.  Table 1: Results of automatic evaluation. ALL and LONG represent, respectively, the results in all sentences and long sentences (longer than average length 28) in the test dataset. d represents the groups of d-length dependency chains. * indicates the model that achieved the best score among the same methods with different d in the development dataset 11 . Bold values indicate the best scores.
ROUGE, and ∆C in all settings. The F 1 scores of HiSAN (ALL) were higher than the current stateof-the-art score of .82, reported by Filippova et al. (2015). The improvements in F 1 and ROUGE scores from the baselines methods in the LONG setting are larger than those in the ALL setting. ¿From these results, we can conclude that d-length dependency chains are effective for sentence compression, especially in the case of longer than average sentences. HiSAN (d = {1}) outperformed HiSAN-Dep in F 1 scores in ALL and LONG settings. This result shows the effectiveness of joint learning the dependency parse tree and the output labels.

Human Evaluation
In the human evaluation, we compared the baselines with our method, which achieved the highest F 1 score in the automatic evaluations. We used the first 100 sentences that were longer than the average sentence length (= 28) in the test set for human evaluation. Similar to Filippova et al. (2015), the compressed sentence was rated by five raters who were asked to select a rating on a five-point Likert scale, ranging from one to five for readability (Read) and for informativeness (Info). We report the average of these scores from the five raters. To investigate the differences between the methods, we also compared the baseline meth-  Table 2: Results of human evaluations. All denotes results for all sentences in the test set, and Diff denotes results for the sentences for which the methods yielded different compressed sentences. Parentheses ( ) denote sentence size. CR denotes the compression ratio. The average gold compression ratio for input sentence in All and Diff were 32.1 and 31.5, respectively. Other notations are similar to those in Table 1. ods and HiSAN using the sentences for which the methods yielded different compressed sentences. Table 2 shows the results. HiSAN (d = {1, 2, 4}) achieved better results than the baselines in terms of both readability and informativeness. The results agree with those obtained from the automatic evaluations. ¿From the results on the sentences whose compressed sentences were different between Base and HiSAN (d = {1, 2, 4}), we can clearly observe the improvement attained by HiSAN (d = {1, 2, 4}) in informativeness.

Input
Pakistan signed a resolution on Monday to import 1,300 MW of electricity from Kyrgyz Republic and Tajikistan to overcome power shortage in summer season, said an official press release .

Gold
Pakistan signed a resolution to import 1,300 MW of electricity from Kyrgyz Republic and Tajikistan .

Tagger
Pakistan signed a resolution to import 1,300 MW of electricity Tajikistan to overcome shortage . Tagger-ILP Pakistan signed resolution to import MW said .

Base
Pakistan signed a resolution to import 1,300 MW of electricity . HiSAN-Dep (d = {1}) Pakistan signed a resolution to import 1,300 MW of electricity .

Analysis
For both examples, the compressed sentence output by Base is grammatically correct. However, the informativeness is inferior to that attained by HiSAN (d = {1, 2, 4}). The compressed sentence output by HiSAN-Dep in the second example lacks both readability and informativeness. We believe that this compression failure is caused by incorrect parse results, because HiSAN-Dep employs the features obtained from the dependency tree in the pipeline procedure.
As reported in recent papers (Klerke et al., 2016;Wang et al., 2017), the F 1 scores of Tagger match or exceed those of the Seq2Seq-based methods. The compressed sentence of the first example in Table 3 output by Tagger is ungrammatical. We believe that this is mainly because Tagger cannot consider the predicted labels of the previous words. Tagger-ILP outputs grammatically incorrect compressed sentences in both examples. This result indicates that THE ILP constraint based on the parent-child relationships between words is insufficient to generate fluent sentences.
Compared with these baselines, HiSAN (d = {1, 2, 4}) output compressed sentences that were fluent and had higher informativeness. This observation, which confirmed our expectations, is sup-ported by the automatic and human evaluation results.

F1
ROUGE ∆C  We confirm that the compression performance of HiSAN actually improves if the sentences have deep dependency trees. Table 4 shows the automatic evaluation results for sentences with deep dependency trees. We can observe that HiSAN with higher-order dependency chains has better compression performance if the sentences have deep dependency trees. Figure 6 shows a compressed sentence and its dependency graph as determined by HiSAN d = {1, 2, 4}. Almost all arcs with large probabilistic weights are contained in the parsed dependency trees. Interestingly, some arcs not contained in the parsed dependency trees connecting words which are connected by the dependency chains in the parsed dependency tree (colored by red). Considering the training dataset does not contain such dependency relationships, we can estimate that these arcs are learned in support of compressing sentences. This result meets our expectation that the dependency chain information is necessary for compressing sentences accurately.

Related Work
Several neural network based methods for sentence compression use syntactic features. Filippova et al. (2015) employs the features obtained from automatic parse trees in the LSTM-based encoder-decoder in a pipeline manner. Wang et al. (2017) trims dependency trees based on the scores predicted by an LSTM-based tagger. Although these methods can consider dependency relationships between words, the pipeline approach and the 1st-order dependency relationship fail to compress longer than average sentences.
Several recent machine translation studies also utilize syntactic features in Seq2Seq models. Eriguchi et al. (2017); Aharoni and Goldberg (2017) incorporate syntactic features of the target language in the decoder part of Seq2Seq. Both methods outperformed Seq2Seq without syntactic features in terms of translation quality. However, both methods fail to provide an entire parse tree until the decoding phase is finished. Thus, these methods cannot track all possible parents for each word within the decoding process. Similar to HiSAN, Hashimoto and Tsuruoka (2017) use dependency features as attention distributions, but different from HiSAN, they use pre-trained dependency relations, and do not take into account the chains of dependencies. ; Bastings et al. (2017) consider higherorder dependency relationships in Seq2Seq by incorporating a graph convolution technique (Kipf and Welling, 2016) into the encoder. However, the dependency information of the graph convolution technique is still given in pipeline manner.
Unlike the above methods, HiSAN can capture higher-order dependency features using d-length dependency chains without relying on pipeline processing.

Conclusion
In this paper, we incorporated higher-order dependency features into Seq2Seq to compress sentences of all lengths.
Experiments on the Google sentence compression test data showed that our higher-order syntactic attention network (HiSAN) achieved the better performance than baseline methods on F 1 as well as ROUGE-1,2 and L scores 83.2, 82.9, 75.8 and 82.7, respectively. Of particular importance, challenged with longer than average sentences, HiSAN outperformed the baseline methods in terms of F 1 , ROUGE-1,2 and L scores. Furthermore, HiSAN also outperformed the previous methods for both readability and informativeness in human evaluations.
From the evaluation results, we conclude that HiSAN is an effective tool for the sentence compression task.