Sentence Compression for Arbitrary Languages via Multilingual Pivoting

In this paper we advocate the use of bilingual corpora which are abundantly available for training sentence compression models. Our approach borrows much of its machinery from neural machine translation and leverages bilingual pivoting: compressions are obtained by translating a source string into a foreign language and then back-translating it into the source while controlling the translation length. Our model can be trained for any language as long as a bilingual corpus is available and performs arbitrary rewrites without access to compression specific data. We release. Moss, a new parallel Multilingual Compression dataset for English, German, and French which can be used to evaluate compression models across languages and genres.


Introduction
Sentence compression aims to produce a summary of a single sentence that retains the most important information while preserving its fluency. The task has attracted much attention due to its potential for applications such as text summarization (Jing, 2000;Madnani et al., 2007;Woodsend and Lapata, 2010;Berg-Kirkpatrick et al., 2011), subtitle generation (Vandeghinste and Pan, 2004;Luotolahti and Ginter, 2015), and the display of text on small-screens (Corston-Oliver, 2001).
The bulk of research on sentence compression has focused on a simplification of the task involving exclusively word deletion (Knight and Marcu, 2002;Riezler et al., 2003;Turner and Charniak, 2005;McDonald, 2006;Clarke and Lapata, 2008;Cohn and Lapata, 2009), whereas a few approaches view sentence compression as a more general text rewriting problem (Galley and McKeown, 2007;Woodsend and Lapata, 2010;Cohn and Lapata, 2013). Irrespective of how the compression task is formulated, most previous work relies on syntactic information such as parse trees to help decide what to delete from a sentence or which rules to learn in order to rewrite a sentence using less words. More recently, there has been much interest in applying neural network models to natural language generation tasks, including sentence compression (Rush et al., 2015;Filippova et al., 2015;Kikuchi et al., 2016). Filippova et al. (2015) focus on deletion-based sentence compression which they model as a sequence labeling problem using a recurrent neural network with long short-term memory units (LSTM; Hochreiter and Schmidhuber 1997). Rush et al. (2015) capture the full gamut of rewrite operations drawing insights from encoderdecoder models recently proposed for machine translation (Bahdanau et al., 2015).
Neural network-based approaches are datadriven, relying on the ability of recurrent architectures to learn continuous features without recourse to preprocessing tools or syntactic information (e.g., part-of-speech tags, parse trees). In order to achieve good performance, they require large amounts of training data, in the region of millions of long-short sentence pairs. 2 Existing compression datasets are several orders of magnitude smaller. For example, the Ziff-Davis corpus (Knight and Marcu, 2002) contains 1,067 sentences and originated from a collection of news articles on computer products. Clarke and Lapata (2008) create two manual corpora sampled from written (1,433 sentences) and spoken sources (1,370 sentences). Cohn and Lapata (2013) elicit manual compressions for 625 sentences taken from newspaper articles. More recently, Toutanova et al. (2016) crowdsource a larger corpus which contains manual compressions for single and multiple sentences (about 26,000 pairs of source and compressed texts).
Since large scale compression datasets do not occur naturally, they must be somehow approximated, e.g., by pairing headlines with the first sentence of a news article (Filippova and Altun, 2013;Rush et al., 2015). As a result, the training corpus construction process must be repeated and reconfigured for new languages and domains (e.g., many headline-first sentence pairs are spurious and need to be filtered using language and domain specific heuristics). And although it may be easy to automatically obtain large scale training data in the news domain, it is not clear how such data can be sourced for many other genres with different writing conventions.
Our work addresses the paucity of data for sentence compression models. We argue that multilingual corpora are a rich source for learning a variety of rewrite rules across languages and that existing neural machine translation (NMT) models (Sutskever et al. 2014;Bahdanau et al. 2015) can be easily adapted to the compression task through bilingual pivoting (Mallinson et al., 2017) coupled with methods which decode the output sequence to a desired length (e.g., subject to language and genre requirements). We obtain compressions by translating a source string into a foreign language and then back-translating it into the source while controlling the translation length (Kikuchi et al., 2016). Our model can be trained for any language as long as a bilingual corpus is available, and can perform arbitrary rewrites while taking advantage of multiple pivots if these exist.We also demonstrate that models trained on multilingual data perform well out-of-domain.
Although our approach does not employ compression corpora for training, for evaluation purposes, we create MOSS, a new Multilingual Compression dataset for English, French, and German. MOSS is a parallel corpus containing documents from the European parliament proceedings, TED talks, news commentaries, and the EU bookshop. Each document is written in English, French, and German, and compressed by native speakers of the respective language who process a document at a time. We obtain five compressions per document leading to 2,000 long-short sentence pairs per language. Like previous related resources (Clarke and Lapata, 2008;Cohn and Lapata, 2013;de Loupy et al., 2010) our corpus is curated manually, however it differs from Toutanova et al. (2016) in that it contains compressions for individual sentences, not documents.
There has been relatively little interest in compressing languages other than English. A few models have been proposed for Japanese (Hori and Furui, 2004;Hirao et al., 2009;Harashima and Kurohashi, 2012), including a neural network model (Hasegawa et al., 2017) which repurposes Filippova and Altun's (2013) data construction method for Japanese. There is a compression corpus available for French (de Loupy et al., 2010), however, we are not aware of any modeling work on this language. Overall, there are no standardized datasets in languages other than English, either for training or testing.
Our contributions in this work are three-fold: a novel application of bilingual pivoting to sentence compression; corroborated by empirical results showing that our model scales across languages and text genres without additional supervision over and above what is available in the bilingual parallel data; and the release of a multilingual, multi-reference compression corpus which can be effectively used to gain insight in the compression task and facilitate further research in compression modeling.

Pivot-based Neural Compression
In our pivot-based sentence compression model an input sequence is first translated into a foreign language, and then back into the source language. Unlike previous paraphrasing pivoting models (Mallinson et al., 2017), we parameterize our translation models with a length feature, which allows us to produce compressed output. We define two models, performing compression in one step or alternatively in two steps which affords more flexibility in model output.

NMT Background
In the neural encoder-decoder framework for MT (Bahdanau et al., 2015;Sutskever et al., 2014), an encoder takes in a source X = (x 1 , ..., x T x ) of length T x and the decoder generates a target sequence (y 1 , ..., y T y ) of length T y . Let h i be the hidden state of the source symbol at position i, obtained by concatenating the forward and backward encoder RNN hidden states, We deviate from previous work (Bahdanau et al., 2015;Sutskever et al., 2014) in that we initialize the decoder with the average of the hidden states, following : where W init is a learnt parameter. Our decoder is a conditional recurrent neural network, specifically a gated recurrent unit (GRU, Cho et al., 2014) with attention, which we denote as cGRU att . cGRU att takes as input the previous hidden state s j−1 , the source annotations C = h 1 , ..., h T x , and the previously decoded symbol y j−1 in order to update its hidden state s j , which is used to decode symbol y j at position j: cGRU att consists of three components. The first combines the previously decoded symbol y j−1 and the previous hidden state s j−1 to generate an intermediate representation s j . The attention mechanism, AT T , inputs the entire context set C along with intermediate hidden state s j in order to compute the context vector c j : Where α i j is the normalized alignment weight between the source symbol at position i and the target symbol at position j, and f is a feedfoward neural network. Finally, we generate s j , the hidden state of cGRU att , by using the intermediate representation s j and the context vector c j . Given s j , y j−1 , and c j the output probability p(y j |s j , y j−1 , c j ) is computed using a feedforward neural network with a softmax activation. We define the probability of sequence y as:

Length Control
To be able to produce compressed sentences, we parameterize our model with a length vector which allows to control the output length. Our approach is similar to the LenInit model of Kikuchi et al. (2016), however we use a GRU instead of an LSTM. The hidden state of the decoder consists of the average of the encoder's hidden states but also a length vector LV , a learnt parameter, which is scaled by the desired target length T y . We therefore rewrite Equation (1) as follows: As such we now define our model as: During training, the target length is set to T y = T y . However, at test time, the target length generally varies according to the domain, genre, and language at hand. We determine the target length experimentally based on a small validation set.

Pivoting
Pivoting is often used in machine translation to overcome the shortage of parallel data, i,e., when there is no translation path from the source language to the target by taking advantage of paths through an intermediate language. The idea dates back at least to Kay (1997), who observed that ambiguities in translating from one language onto another may be resolved if a translation into some third language is available, and has met with success in phrase-based SMT (Wu and Wang, 2007;Utiyama and Isahara, 2007) and more recently in neural MT systems (Firat et al., 2016). We use pivoting to provide a path from a source English sentence, via an intermediate foreign language, to English in a compressed form. We propose to extend Mallinson et al.'s (2017) approach to multi-pivoting, where a sentence x is translated to K-best foreign pivots, F x = { f 1 , ..., f K }. The probability of generating compression y = y 1 ...y T y is decomposed as: which we approximate as the tokenwise weighted average of the pivots: where y < j = y 1 , ...y j . To ensure a probability distribution, we normalize the K-best list F x , such that the translation probabilities sum to one. We use beam search to decode tokens by conditioning on multiple pivoting sentences. The results with the best decoding scores are considered candidate compressions. To ensure the model produces compressed output, we extend the pivoting approach in two ways. In single step compression, one of the translation  Figure 1: Histograms of output lengths at three compression rates (CR) compared to a vanilla encoderdecoder system which does not manipulate output length. German is used as pivot for English, and English as pivot for French and German. models is parameterized with length information: In dual-step compression, we parameterize both translation models with length information: We find that dual-compression performs better when the system is expected to drastically compress the source sentence (e.g., in a headline generation task). Imposing a high compression ratio from the start tends to produce unintelligible text. The model attempts to reduce the length of the source at all costs, even at the expense of being semantically faithful to the input. Performing two moderate compressions in succession reduces both length and content conservatively and as a result produces more meaningful text.
In Figure 1 we illustrate how the pivot-based model sketched above can successfully control the output of the generated compressions. We show the output of a single-step compression model on three languages initialized with varying compression rates 3 (see Section 4 for details on how the models were trained and tested). The compression rate (CR) is used to determine length parameter of Equation (8): The figure shows how the output length varies compared to a vanilla encoder-decoder system which uses pivoting to backtranslate the source English French German On the very day that the earthquake struck, the European Council asked the High Representative and the Commission to mobilise all appropriate assistance.
Am gleichen Tag, an dem das Erdbeben ausbrach, ersuchte der Europäische Rat die Hohe Vertreterin und die Kommission um die Mobilisierung aller angemessenen Hilfe. Assistance was mobilized on the very day of the earthquake.
We're at a tipping point in human history, a species poised between gaining the stars and losing the planet we call home.
We're at tipping point in human history, poised between gaining the stars and losing the Earth.
Wir sind vor einem historischen Wendepunkt: zwischen dem Griff nach Sternen und Verlust unseres Planeten. Surveys undertaken by the World Bank in developing countries show that when poor people are asked to name the three most important concerns they face good health is always mentioned.
World Bank surveys in developing countries show poor people always name good health as an important concern.

Umfragen
in Entwicklungslndern zeigen, dass bei Armen das wichtigste Anliegen Gesundheit ist. were asked to compress while preserving the most important information, ensuring the sentences remained grammatical and meaning preserving. Annotators were encouraged to use any rewriting operations that seemed appropriate, e.g., to delete words, add new words, substitute them, or reorder them. Annotation proceeded on a document-bydocument basis, line-by-line. Crowdworkers compressed the first twenty lines of each document and we elicited five compression per document. Example compressions are shown in Table 1. Table 2 presents various statistics on our corpus. As can be seen, Europarl contains the longest sentences across languages (see column SL), TED contains the shortest sentences, while the other two corpora are somewhere in-between. We also observe that crowdworkers compress the least when it comes to TED (see column CR), which is not surprising given the brevity of the utterances. Overall, French speakers seem more conservative when shortening sentences compared to English and German. In general, compression rates are genre dependent, they range from 0.58 (for English Europarl) to 0.84 (for German TED). We also examined the degree to which crowdworkers paraphrase the source sentence using Translation Edit Rate (TER; Snover et al., 2006), a measure com-monly used to automatically evaluate the quality of machine translation output. We used TER to compute the (average) number of edits required to change a long sentence to shorter output. We also report the number of edits by type, i.e., the number of insertions, substitutions, deletions, and shifts needed (on average) to convert long to short sentences. We observe that crowdworkers perform a fair amount of rewriting across corpora and languages. The most frequent rewrite operations are deletions followed by substitutions, shifts, and insertions.

Experimental Setup
Neural Machine Translation Training Nematus  was used as the machine translation system for all our experiments. We generally used the default settings and training procedures as specified within Nematus. All networks have a hidden layer size of 1,000, and an embedding layer size of 512. In addition, layer normalization (Ba et al., 2016) was used. During training, we used ADAM (Kingma and Ba, 2014), a minibatch size of 80, and the training set was reshuffled between epochs. We also employed early stopping.  We used up to four encoder-decoder NMT models in our experiments (BLEU scores 4 shown in parentheses): English→French (27.03), French→English (29.14), English→German (28.3), and German→English (31.19). German training/test data was taken from the WMT16 shared task and French from the WMT14 shared task. The training data was 4.2 million and 39 million sentence pairs for en-de, and en-fr, respectively. We also used back-translated monolingual training data, from the news domain, (Sennrich et al., 2016a) in training for the German systems.
The data was pre-processed using standard scripts found in MOSES (Koehn et al., 2007). Rare words were split into sub-word units, using byte pair encoding (BPE; Sennrich et al. 2016b). The BPE operations are shared between language directions.
We experimented with various model variants using one or multiple pivots. The compression rate (see Equation (8)) was tuned experimentally on the validation set which consists of one document from each domain (20 source sentences; 100 compression-pairs). Compression rates varied from 0.55 to 0.85 and were broadly comparable to those shown in Table 2. 4 BLEU scores were calculated using mteval-v13a.pl.
Comparison Systems We compared our model against ABS, a sequence-to-sequence attentionbased model, developed by Rush et al. (2015). This model was trained on a monolingual dataset extracted from the Annotated English Gigaword corpus (Napoles et al., 2011). The dataset consists of approximately 4 million pairs of the first sentence from each source document and its headline. We also trained LenInit (Kikuchi et al., 2016) on the same corpus which is conceptually similar to ABS but additionally controls the output length using a length embedding vector (as described in Section 2.2). 5 Unfortunately, we could not train these models for French or German, since there are no monolingual sentence compression datasets available at a similar scale. An obvious workaround is to translate Gigaword to French and German and then train compression models on the translated data. As the quality of the translation is relatively poor, we also translated German or French into English, compressed it with ABS and LenInit trained on the Gigaword corpus, and then translated the compressions back to French or German. Finally, we include a prefix (Pfix) baseline which does not perform any rewriting but simply truncates the source sentence so that it matches the compression ratio of the validation set.

MOSS Evaluation
We assessed model performance using three automatic metrics which represent different aspects of the compression task and have been found to correlate well with human judgments (Toutanova et al., 2016;Clarke and Lapata, 2006). These include a recall metric based on skip bi-grams, any pair of words in a sequence allowing for gaps of size four 6 (RS-R); a recall metric based on bi-grams of dependency tree triples (D2-R); and bi-gram ROUGE (R2-F1). We used the Stanford neural network parser (Chen and Manning, 2014) to obtain dependency triples. Table 3(a) reports results on English with a model which controls the output length (L) and uses either a single pivot (SP; K = 1) or multiple pivots (MP; K = 10). We experimented with French (fr) or German (de) as pivot languages. All pivot-based models perform compression in a single step (see Section 2.3). Dual-step compres-  pivot languages: English (en), French (fr), German (de); ABS (Rush et al., 2015) and LenInit (Kikuchi et al., 2016) are sequence-to-sequence models trained on Gigaword; Gold is inter-annotator agreement. SP We are at a turning point in human history and losing the planet we call home.
Zwischen dem Griff der Sterne und dem Verlust unseres Planeten stehen wir vor. ABS Poor people ask to name the three most important concerns.
Les enquêtes de la Banque mondiale révèlent que la santé fait toujours partie de la liste.

SP Polls conducted by the World Bank
show that when poor people are asked to mention the three main concerns.
Wenn man die Armen nach den drei Hauptanliegen fragt, werden sie gefordert. Table 4: System output for the example source sentences in Table 1. sion obtained inferior results which we omit for the sake of brevity. As can be seen, models which use a single pivot are better than those using multiple ones (German is better than French; see SP de vs SP f r ). More pivots might introduce noise at the expense of translation quality.
Overall, pivot-based models outperform ABS and LenInit. This is perhaps to be expected since these models are tested on out of domain data with different vocabulary and writing conventions; MOSS does not contain any newspaper articles. Unfortunately, it is not possible to train ABS and LenInt on in-domain data as compression data only exists for the headlines-first sentences pairs. As an upper bound, we also report how well humans agree with each other, treating one (randomly selected) reference as system output and computing how it agrees with the rest (row Gold in Table 3). All models lag significantly behind human performance on this task.
Tables 3(b) and 3(c) report results on French and German, respectively. For these languages, we obtained best results with English as pivot, using a single-step compression model. ABS and LenInit perform poorly when trained directly on translations of Gigaword into French and German; their performance improves considerably when they are trained on the Gigaword and used to compress English translations of French or German (ABS en , LenInit en ). Again, we observe that our models (SP L,en , MP L,en ) outperform the comparison systems across all metrics and that using a single pivot yields better compressions. Example compressions are given in Table 4 where we show output produced by ABS and SP for each language (see the supplementary material for more examples). Finally, notice that automatic scores for the prefix baseline across languages are misleadingly high, since it simply repeats the source sentence up to a fixed length without performing any rewriting.    We also elicited human judgments through the Crowdflower platform. We asked crowdworkers to rate the grammaticality of the target compressions and whether they preserved the most important information from the source. In both cases, they used a five-point rating scale where a high number indicates better performance. We randomly selected 25 sentences from each corpus from the test portion of MOSS, i.e., 100 long-short sentence pairs per language. We compared compressions generated by our model (SP L ), with ABS models for the three languages, the prefix baseline, and (randomly selected) gold-standard reference (Ref) compressions from MOSS. All systems used the length parameter to allow comparisons with approximately the same compression rates. We collected five ratings per compression. Our results are summarized in Table 5. We show mean ratings for grammaticality (Gram), importance (Imp) and their combination (column Avg). Across languages our model (SP L ) significantly (p < 0.05) outperforms comparison systems (Pfix, ABS) on both dimensions of grammaticality and importance (significance tests were performed using a student t-test). All systems are significantly worse (p < 0.05) than the human reference compressions.
Finally, in Table 6 we analyze the output of our best model (SP L ) using the same statistics we applied to the human compressions (see Table 2). As can be seen, the model generally compressess more aggressively and applies more ed-  (Rush et al., 2015) 26.55 7.06 22.05 ABS+ (Rush et al., 2015) 28.18 8.49 23.81 RAS  28.97 8.26 24.06 LenInit 8 (Kikuchi et al., 2016) 25.87 8.27 23.24 LenEmb (Kikuchi et al., 2016) 26.73 8.40 23.88  (Rush et al., 2015; in compressing the first sentence of the document and presenting this as the summary. To make the evaluation unbiased to length, the output of all systems is cut-off after 75-characters and no bonus is given for shorter summaries. Our results are shown in Table 7. To compare with existing methods, we also report ROUGE (Lin, 2004) unigram and bigram overlap (Lin, 2004) and the longest common subsequence (ROUGE-L). 9 We employed a dual step compression model (see Section 2) as preliminary experiments showed that it was superior to singlestage variants. We compared single and multiple pivot models against existing ABS and ABS+ (Rush et al., 2015), two encoder-decoder models trained on the English Gigaword. ABS+ applies minimum error rate (MERT) training as a copy-Source King Norodom Sihanouk has declined requests to chair a summit of Cambodia's top political leaders, saying the meeting would not bring any progress in deadlocked negotiations to form a government. SP L,de King Norodom Sihanouk has refused to chair Cambodia summit. Gold Sihanouk refuses to chair Cambodian political summit at home or abroad. Source Cambodia's ruling party responded Tuesday to criticisms of its leader in the U.S. Congress with a lengthy defense of strongman Hun Sen's human rights record. SP L,de Cambodia's ruling party responded Tuesday to criticism of its leader in the US. Gold Cambodian party defends leader Hun Sen against criticism of U.S. House. Source The Swiss government has ordered no investigation of possible bank accounts belonging to former Chilean dictator Augusto Pinochet, a spokesman said Wednesday. SP L,de Swiss government ordered no inquiry into possible bank accounts of former Chilean dictator Augusto. Gold Switzerland joins charges against Pinochet but avoids bank probe. ing mechanism. LenEmb and LenInit include a length parameter (Kikuchi et al., 2016), whereas RAS uses a specialized recurrent neural network architecture (Elman, 1990). We also report how well DUC-2004 abstractors agree with each other (row Gold in Table 7). Example compressions are given in Table 8, where we show output produced by SP L,de and a human reference (see the supplementary material for further examples).
Using automatic metrics we see that our model generally performs worse compared to these systems and that German is the best pivot for English. Although the objective of this paper is not to obtain state-of-the-art scores on this evaluation set, it is interesting to see that our model is able to compress out-of-domain. We do not have access to headline-first sentence pairs, while all comparison systems do. We also elicited human judgments on the compressions of 100 lead sentences whose documents were randomly selected from the DUC-2004 test set. We compared the prefix baseline, our model (SP L,de ), ABS+ (Rush et al., 2015), LenEmb (Kikuchi et al., 2016), Topiary (Zajic et al., 2004), and a randomly selected reference. Topiary came top in almost all measures in the DUC-2004 evaluation; it first compresses the lead sentence using linguistically motivated heuristics and then enhances it with topic keywords. Crowdworkers rated grammaticality and importance, using a five-point scale; we collected five ratings per compression.
As shown in Table 9 ABS+ has the lead with our system following suit. In terms of grammaticality, ABS+ and SP L,de are not significantly different from the gold standard or from each other (Pfix, Topiary, and LenEmb are significantly worse than Gold; p < 0.05). In terms of importance, pairwise differences between systems and the gold standard are not significant. Overall, we observe that SP L,de performs comparably to ABS+ even though it was  not trained on any compression specific data. Inspection of system output reveals that our model performs more paraphrasing than comparison systems (a conclusion also confirmed by the statistics in Table 6).

Conclusions
In this paper we have shown that multilingual corpora can be used to bootstrap compression models across languages and text genres. Our approach adapts existing neural machine translation machinery to the compression task coupled with methods which decode the output to a desired length. An interesting direction for future work would be to train our model using reinforcement learning (Ranzato et al., 2016;Zhang and Lapata, 2017) in order to control the compression output more directly. Moreover, although we do not use any direct supervision in our experiments, it would be interesting to incorporate it as a means of domain adaptation (Cheng et al., 2016).