Unsupervised Rewriter for Multi-Sentence Compression

Multi-sentence compression (MSC) aims to generate a grammatical but reduced compression from multiple input sentences while retaining their key information. Previous dominating approach for MSC is the extraction-based word graph approach. A few variants further leveraged lexical substitution to yield more abstractive compression. However, two limitations exist. First, the word graph approach that simply concatenates fragments from multiple sentences may yield non-fluent or ungrammatical compression. Second, lexical substitution is often inappropriate without the consideration of context information. To tackle the above-mentioned issues, we present a neural rewriter for multi-sentence compression that does not need any parallel corpus. Empirical studies have shown that our approach achieves comparable results upon automatic evaluation and improves the grammaticality of compression based on human evaluation. A parallel corpus with more than 140,000 (sentence group, compression) pairs is also constructed as a by-product for future research.


Introduction
Multi-sentence compression (MSC) aims to generate a single shorter and grammatical sentence that preserves important information from a group of related sentences. Over the past decade, multisentence compression has attracted considerable attention owing to its potential applications, such as compressing the content to be displayed on screens with limited size (e.g., mobile devices) and benefiting other natural language processing tasks, such as multi-document summarization (Banerjee et al., 2015), opinion summarization, and text simplification. Most existing works rely on the word graph approach initialized in (Filippova, 2010), which offers a simple solution that copies frag-ments from different input sentences and concatenates them to form the final compression. Later on, a bunch of subsequent research works (Boudin and Morin, 2013;Banerjee et al., 2015;Luong et al., 2015;ShafieiBavani et al., 2016;Pontes et al., 2018;Nayeem et al., 2018) attempted to improve the word graph approach using a variety of strategies, such as keyphrase re-ranking. However, such extraction-based approach may yield nonfluent or ungrammatical compression. A previous study (Nayeem and Chali, 2017) has shown that word graph approaches produce more than 30% of the ungrammatical sentences (evaluated by a chart parser), which is partly due to the non-usage of rewording by these extraction-based approaches. In fact, human annotators tend to compress a sentence through several rewriting operations, such as substitution and rewording (Cohn and Lapata, 2008). Despite some research works that attempt to do the lexical substitution, it is often inappropriate without the consideration of context information.
To tackle the above-mentioned problems, we present herein an unsupervised rewriter to improve the grammaticality of compression while introducing an appropriate amount of novel words. Inspired by the unsupervised machine translation (Sennrich et al., 2015;Fevry and Phang, 2018), we adopted the back-translation technique to our setting. Unlike machine translation, in the case of compression task, multiple input sentences and single output compression usually do not have semantic equivalence, which complicates the application of the back-translation technique. Thus, we propose a rewriting scheme that first exploits word graph approach to produce coarse-grained compression (B), based on which we substitute words with their shorter synonyms to yield paraphrased sentence (C). A neural rewriter is subsequently applied to the semantically equivalent (B, C) pairs in order to improve the grammaticality and encourage more novel words in compression. Our contributions are two-folds:(i) we present a neural rewriter for multi-sentence compression without any parallel data. This rewriter significantly improves the grammaticality and novel word rate, while maintaining the information coverage (informativeness) according to automatic evaluation and (ii) a large-scale multi-sentence compression corpus is introduced along with a manually created test set for future research. We release source code and data here 1 .

Dataset Construction
The largest existing English corpus for multisentence compression is the Cornell corpus (McKeown et al., 2010), which has only 300 instances. We introduce herein a large-scale dataset by compiling the English Gigaword 2 . After preprocessing (e.g., filtering strange punctuations, etc.), 1.37 million news articles were yielded to group related sentences. The full procedure for the dataset construction is available here 3 .

Group Related Sentences
The prerequisite for multi-sentence compression is that all input sentences should be related to the same topic or event. Inspired by (McKeown et al., 2010), if the sentences are too similar, one of the input sentences could be directly treated as a compression. In contrast, if the sentences are too dissimilar (no interaction), they may describe different events or topics. Both cases should be avoided because sentence compression would not be necessary. Here we use bi-gram similarity, which exhibited the highest accuracy (90%) 4 . We empirically arrived at 0.2 of the lower threshold of the bigram similarity to avoid very dissimilar sentences and 0.7 of the upper threshold of the bigram similarity to avoid near-identical sentences. As presented in Table 1, 140,572 sentence groups were finally yielded out of 1.37 million new articles. We refer to this as the Giga-MSC dataset.

Giga-MSC Dataset Annotation
We randomly selected 150 sentences for human annotation, which were used as reference compression in the automatic evaluation. Two annotators 5 were asked to generate one single reduced grammatical compression that satisfies two conditions:(1) conveys the important content of all the input sentences and (2) should be grammatically correct. We are interested in how the human annotators will perform this task without vocabulary constraints; hence, we did not tell them to introduce as little new vocabulary as possible in their compression as several previous works did (Boudin and Morin, 2013;Luong et al., 2015).
Inter-agreement score Fleiss' Kappa (Artstein and Poesio, 2008) was also computed. The score was 0.43, demonstrating that moderate agreement was reached.
3 Methodology Figure 1 illustrates our rewriting approach consisting of three steps.

Step.1 (B→C)
C is yielded by substituting words and phrases in B with synonyms. We first identified all the multiword expressions in a sentence and determined all the synonyms in WordNet 3.0 6 . Keep in mind that our goal is to shorten the sentence as much as possible, we specifically substituted multiword expressions, such as police officer, united states of america, with their shorter synonyms policeman and u.s.. Because the size of synonyms in the WordNet dictionary is relatively limited, we also exploit PPDB 2.0 7 to replace A: m input sentences B: coarse-grained compression Synonyms substitution Step 1 C: paraphrased compression s 1 , s 2 , … s m Step 2 Step 3 B + B' C + C' train forward model with 1M + 140K pairs C: paraphrased compression B: coarse-grained compression train backward model with 140k pairs Word graph approach feed 1 million C' to pre-trained backward model and yield 1 million B' as pseudo parallel data Figure 1: Graphic illustration for the rewriter model. A refers to multiple input sentences. B denotes a single compressed sentence using the word graph approach. C is the paraphrased sentence. C ′ is a large-scale and indomain monolingual corpus, while B ′ refers to the predicted compression by a pre-trained backward model given C ′ as input. B + B ′ and C + C ′ are the mixing datasets. the nouns, verbs, and adjectives with their shorter counterparts. For example, the verb demonstrating is converted into proved. By using the Giga-MSC dataset we created, 140,000 (A, B, C) tuples are yielded. Lexical substitution might lead to nonfluency C but significantly increases the number of novel words. Therefore, the next steps focus on creating pseudo parallel data to boost the fluency of C while attempting to maintain the rate of novel words.

Step2 (C→B)
Because the yielded B and C are semantically equivalent, we train a backward model (C→B) using 140,000 (C, B) pairs. The backward model consisted of a three-layer bi-directional LSTM encoder and a uni-directional decoder with attention mechanism. After the backward model was trained, one million grammatical in-domain sentences C ′ were given as input to generate one million B ′ The average length of C ′ was similar to that of C (30.2 tokens). We also found that C ′ maintained a novel rate of approximately 8.9, as compared to B ′ .

Step.3 (B+B'→C+C')
We merge the training data (coarse-grained compression B and non-fluent paraphrasing compression C) and the pseudo parallel data (pseudo sentence B ′ and grammatical sentence C ′ ) to jointly learn a forward model that consisted of a threelayer LSTM encoder and decoder. The vocabulary and word embedding were shared between both backward and forward models. We expect that because the grammatical C ′ accepts the majority of training data, it will improve the fluency of C.

Datasets
We used two datasets to evaluate the model performance. First is the Giga-MSC dataset detailed in Section 2. A total of 150 annotated sentences were used as the ground truth for testing. Second is the Cornell dataset (McKeown et al., 2010).

Baseline Approaches
We considered (#1) the word graph approach (Filippova, 2010), and an advanced version (#2) keyphrase-based word graph model (Boudin and Morin, 2013) augmented with keyphrase identification (Wan and Xiao, 2008), as our word graph baselines. Additionally, (#3) the hard paraphrasing (Hard-Para) approach directly substituted words and phrases with their shorter synonyms by using WordNet and PPDB 2.0 (size M is chosen with 463,433 paraphrasing pairs). (#4) Seq2seq model was trained using (B, C) pairs. We considered both of them as comparison approaches as well. The training details are presented in Appendix 1. We release the source code here 8 .

Out-of-Vocabulary (OOV) Word Handling
Both datasets were from the news domain; hence, there are lots of organizations and names that are out of vocabulary. We tackled this problem by exploiting the approach in (Fevry and Phang, 2018).   Given an input sequence, we first identified all OOV tokens and numbered them in order. We stored the map from the numbered OOV tokens (e.g., OOV1 and OOV2) to words. The corresponding word embeddings were also assigned to each numbered OOV token. We then applied the same numbering system to the target. At the inference, we replaced any output OOV tokens with their corresponding words using the map that was stored beforehand, which allowed us to produce words that were not in the vocabulary.

Results and Analysis
METEOR metric (n-gram overlap with synonyms) was used for automatic evaluation. The novel ngram rate 9 (e.t., NN-1, NN-2, NN-3, and NN-4) was also computed to investigate the number of novel words that could be introduced by the models. Table 2 and Table 3 present the results and below are our observations: (i) keyphrase word graph approach (#2) is a strong baseline according to the METEOR metric. In comparison, the proposed rewriter (#5) yields comparable result on the METEOR metric for the Giga-MSC dataset but lower result for the Cornell dataset. We speculate that it may be due to the difference in the ground-truth compression. 8.6% of novel unigrams exist in the ground-truth compression of the 9 Novel n-gram rate = 1 − |S∩C| |C| where S refers to the set of words from all input sentences while C refers set of words from compression. Giga-MSC dataset, while only 5.2% of novel unigrams exist in that of the Cornell dataset, (ii) Hard Para.(#3), Seq2seq (#4), and our rewriter (#5) significantly increase the number of novel n-grams, and the proposed rewriter (#5) seemed to be a better trade-off between the information coverage (measured by METEOR) and the introduction of novel n-grams across all methods, (iii) on comparing with Seq2seq (#4) and our rewriter (#5), we found that adding pseudo data helps to decrease the novel words rate and increase the METEOR score on both datasets.  Human Evaluation As METEOR metric cannot measure the grammaticality of compression, we asked two human raters 10 to assess 50 compressed sentences out of the Giga-MSC test dataset in terms of informativeness and grammaticality. We used 0-2 point scale (2 pts: excellent; 1 pts: good; 0 pts: poor), similar to previous work (we recommend readers to refer to Appendix 2 for the 0-2 scale point evaluation details). Table 4 shows the Sentence 1 Alleged Russian mobster Alimzhan Tokhtakhounov, accused of conspiring to fix skating events at the 2002 Winter Olympics in salt lake city, has returned to Moscow, the Kommersant daily reported wednesday. Sentence 2 US prosecutors accused Tokhtakhounov of conspiring to fix the artistic skating events at the salt lake city games with the assistance of the French and Russian judges. KWG US prosecutors accused Tokhtakhounov, accused of conspiring to fix the artistic skating events at the salt lake city, has returned to Moscow, the Kommersant daily reported wednesday. RWT Tokhtakhounov, accused of conspiracy to fix the artistic skating events at the salt lake town, has returned to Moscow, the Kommersant daily reported. average ratings for informativeness and readability. From that, we found that our rewriter (RWT) significantly improved the grammaticality of compression in comparison with the keyphrase word graph approach, implying that the pseudo data may contribute to the language modeling of the decoder, thereby improving the grammaticality.

Method Informativeness
Context Awareness Evaluation Because several novel words were introduced in Hard Para. (#3), Seq2seq (#4), and our rewriter (#5), we were interested to determine whether the compressions generated by these models were context-aware.
We herein considered an out-of-the-box contextaware encoder, BERT (Devlin et al., 2018). The evaluation proceeded as follows: As for a sentence with N words, S = [w 1 , w 2 , ..., w N ], we sequentially masked each word at a time and calculated the average likelihood using this formula: CXT (S) = 1 n n i=1 −logp(w i |c) where c = [w 1 , ...w i−1 , w i+1 , ..., w n ]. We used the implementation mentioned in 11 . The low likelihood CT X(S) may suggest a better context awareness. As presented in Table 6, the proposed rewriter achieves the lowest likelihood on both datasets, thereby indicating better context awareness in its generated compression.
Case Study To illustrate the pros and cons of the proposed rewriter, as listed in Table 5, we conducted a case study where two sentences were given as input and two compression outputs were produced by KWG and RWT. We observed that the RWT corrected the ungrammatical parts (e.t., underlined words,) generated by KWG. However, paraphrasing was not always accurate because 11 https://github.com/xu-song/bert-as-language-model  phrases such as salt lake city are fixed collocations. This may degrade the informativeness of the compression.

Conclusion
In this work, we propose a coarse-to-fine rewriter for multi-sentence compression with a specific focus on improving the quality of compression.
The experimental results show that the proposed method produced more grammatical sentences, meanwhile introducing novel words in the compression. Furthermore, we presented an approach for the evaluation of context-awareness which may shed light on automatic evaluation for quality of sentence by virtue of pre-trained models. In the future, we will consider extending the current approach to the single document or multiple document summarization.