A Deep Reinforced Model for Zero-Shot Cross-Lingual Summarization with Bilingual Semantic Similarity Rewards

Cross-lingual text summarization aims at generating a document summary in one language given input in another language. It is a practically important but under-explored task, primarily due to the dearth of available data. Existing methods resort to machine translation to synthesize training data, but such pipeline approaches suffer from error propagation. In this work, we propose an end-to-end cross-lingual text summarization model. The model uses reinforcement learning to directly optimize a bilingual semantic similarity metric between the summaries generated in a target language and gold summaries in a source language. We also introduce techniques to pre-train the model leveraging monolingual summarization and machine translation objectives. Experimental results in both English–Chinese and English–German cross-lingual summarization settings demonstrate the effectiveness of our methods. In addition, we find that reinforcement learning models with bilingual semantic similarity as rewards generate more fluent sentences than strong baselines.


Introduction
Cross-lingual text summarization (XLS) is the task of compressing a long article in one language into a summary in a different language. Due to the dearth of training corpora, standard sequence-to-sequence approaches to summarization cannot be applied to this task. Traditional approaches to XLS thus follow a pipeline, for example, summarizing the article in the source language followed by translating the summary into the target language or viceversa (Wan et al., 2010;Wan, 2011). Both of these approaches require separately trained summarization and translation models, and suffer from error propagation (Zhu et al., 2019). Prior studies have attempted to train XLS models in an end-to-end fashion, through knowledge distillation from pre-trained machine translation (MT) or monolingual summarization models (Ayana et al., 2018;Duan et al., 2019), but these approaches have been only shown to work for short outputs. Alternatively, Zhu et al. (2019) proposed to automatically translate source-language summaries in the training set thereby generating pseudo-reference summaries in the target language. With this parallel dataset of source documents and target summaries , an end-to-end model is trained to simultaneously summarize and translate using a multi-task objective. Although the XLS model is trained end-to-end, it is trained on MT-generated reference translations and is still prone to compounding of translation and summarization errors.
In this work, we propose to train an end-to-end XLS model to directly generate target language summaries given the source articles by matching the semantics of the predictions with the semantics of the source language summaries. To achieve this, we use reinforcement learning (RL) with a bilingual semantic similarity metric as a reward (Wieting et al., 2019b). This metric is computed between the machine-generated summary in the target language and the gold summary in the source language. Additionally, to better initialize our XLS model for RL, we propose a new multi-task pretraining objective based on machine translation and monolingual summarization to encode common information available from the two tasks. To enable the model to still differentiate between the two tasks, we add task specific tags to the input (Wu et al., 2016).
We evaluate our proposed method on English-Chinese and English-German XLS test sets. These test corpora are constructed by first using an MTsystem to translate source summaries to the target language, and then being post-edited by human annotators. Experimental results demonstrate that just using our proposed pre-training method without fine-tuning with RL improves the bestperforming baseline by up to 0.8 ROUGE-L points. Applying reinforcement learning yields further improvements in performance by up to 0.5 ROUGE-L points. Through extensive analyses and human evaluation, we show that when the bilingual semantic similarity reward is used, our model generates summaries that are more accurate, longer, more fluent, and more relevant than summaries generated by baselines.

Model
In this section, we describe the details of the task and our proposed approach.

Problem Description
We first formalize our task setup. We are given N articles and their summaries in the source language src )} as a training set. Our goal is to train a summarization model f (·; θ) which takes an article in the source language x src as input and generates its summary in a pre-specified target languageŷ tgt = f (x src ; θ). Here, θ are the learnable parameters of f . During training, no gold summary y (i) tgt is available. Our model consists of one encoder, denoted as E, which takes x src as input and generates its vector representation h. h is fed as input to two decoders. The first decoder D 1 predicts the summary in the target language (ŷ tgt ) one token at a time. The second decoder D 2 predicts the translation of the input text (v tgt ). While both D 1 and D 2 are used during training, only D 1 is used for XLS at test time. Intuitively, we want the model to select parts of the input article which might be important for the summary and also translate them into the target language. To bias our model to encode this behavior, we propose the following algorithm for pre-training: • Use a machine translation (MT) model to generate pseudo reference summaries (ỹ tgt ) by translating y src to the target language. Then, translateỹ tgt back to the source language using a target-to-source MT model and discard the examples with high reconstruction errors, which are measured with ROUGE (Lin, 2004) scores. The details of this step can be found in Zhu et al. (2019).
• Pre-train the model parameters θ using a multitask objective based on MT and monolingual summarization objectives with some simple yet effective techniques as described in §2.2.
• Further fine-tune the model using reinforcement learning with bilingual semantic similarity metric (Wieting et al., 2019b) as reward, which is described in §2.3.

Supervised Pre-Training Stage
Here, we describe the second step of our algorithm ( Figure 2). The pre-training loss we use is a weighted combination of three objectives. Similarly to Zhu et al. (2019), we use an XLS pretraining objective and an MT pre-training objective as described below with some simple but effective improvements. We also introduce an additional objective based on distilling knowledge from a monolingual summarization model.
XLS Pre-training Objective (L xls ) This objective computes the cross-entropy loss of the predictions from D 1 , considering the machine-generated summaries in the target language,ỹ (i) tgt as references, given x (i) src as inputs. Per sample, this loss can be formally written as: where M is the number of tokens in the summary i. Zhu et al. (2019) argue that machine translation can be considered a special case of XLS with a compression ratio of 1:1. In line with Zhu et al. (2019), we train E and D 2 as the encoder and decoder of a translation model using an MT parallel corpus

Joint Training with Machine Translation
The goal of this step is to make the encoder have an inductive bias towards encoding information specific to translation. Similar to L xls , the machine translation objective per training sample L mt is: tgt . The L xls and L mt objectives are inspired by Zhu et al. (2019). We propose the following two enhancements to the model to leverage better the two objectives: 1. We share the parameters of bottom layers of the two decoders, namely D 1 and D 2 , to share common high-level representations while the parameters of the top layers more specialized to decoding are separately trained.
2. We append an artificial task tag SUM (during XLS training) and TRANS (during MT training) at the beginning of the input document to make the model aware of which kind of input it is dealing with.
We show in §4.1 that such simple modifications result in noticeable performance improvements.
Knowledge Distillation from Monolingual Summarization To bias the encoder to identify sentences which can be most relevant to the summary, first, we use an extractive monolingual summarization method to predict the probability q i of each sentence or keyword in the input article being relevant to the summary. We then distill knowledge from this model into the encoder E by making it predict these probabilities. Concretely, we append an additional output layer to the encoder of our model and it predicts the probability p i of including the i-th sentence or word in the summary. The objective is to minimize the difference between p i and q i . We use the following loss (for each sample) for the model encoder: 2 where L is the number of sentences or keywords in each article. Our final pre-training objective during the supervised pre-training stage is: where λ is a hyper-parameter and is set to 10 in our experiments. Training with L mt requires an MT parallel corpus whereas the other two objectives utilize the cross-lingual summarization dataset. Pretraining algorithm alternates between the two parts of the objective using mini-batches from the two datasets as follows until convergence: 1. Sample a minibatch from the MT corpus tgt )} and train the parameters of E and D 2 with L mt .
2. Sample a minibatch from the XLS corpus, tgt )} and train the parameters of E and D 1 with L xls + λL dis .

Reinforcement Learning Stage
For XLS, the target language reference summaries (ỹ tgt ) used during pre-training are automatically generated with MT models and thus they may contain errors. In this section, we describe how we further fine-tune the model using only humangenerated source language summaries (y src ) with reinforcement learning (RL). Specifically, we first feed the article x src as an input to the encoder E, and generate the target language summaryŷ tgt using D 1 . We then compute a cross-lingual similarity metric betweenŷ tgt and y src and use it as a reward to fine-tune E and D 1 .
Following Paulus et al. (2018), we adopt two different strategies to generateŷ tgt at each training iteration, (a)ŷ s tgt obtained by sampling from the softmax layer at each decoding step, and (b)ŷ g tgt obtained by greedy decoding. The RL objective per sample is given by: (3) where r(·) is the reward function. To fine-tune the model, we use the following hybrid training objective: γL rl + (1 − γ)L xls , where γ is a scaling factor.
We train a cross-lingual similarity model (XSIM) with the best performing model in Wieting et al. (2019b). This model is trained using an MT parallel corpus. Using XSIM, we obtain sentence representations for bothŷ tgt and y src and treat the cosine similarity between the two representations as the reward r(·).

Datasets
We evaluate our models on English-Chinese and English-German article-summary datasets. The English-Chinese dataset is created by Zhu et al. (2019), constructed using the CNN/DailyMail monolingual summarization corpus (Hermann et al., 2015). The training, validation and test sets consist of about 364K, 3K and 3K samples, respectively. The English-German dataset is our contribution, constructed from the Gigaword dataset (Rush et al., 2015). We sample 2.48M training, 2K validation and 2K test samples from the dataset. Pseudoparallel corpora for both language pairs are constructed by translating the summaries to the target language (and filtered after back-translation; see §2). This is done for training, validation as well as test sets. These two pseudo-parallel training sets are used for pre-training with L xls . Translated Chinese and German summaries of the test articles are then post-edited by human annotators to construct the test set for evaluating XLS. We refer the readers to (Zhu et al., 2019) for more details. For the English-Chinese dataset, we use word-based segmentation for the source (articles in English) and character-based segmentation for the target (summaries in Chinese) as in (Zhu et al., 2019). For the English-German dataset, byte-pair encoding is used (Sennrich et al., 2016) with 60K merge operations. For machine translation and training the XSIM model, we sub-sample 5M sentences from the WMT2017 Chinese-English and WMT2014 German-English training dataset (Bojar et al., 2014(Bojar et al., , 2017.

Implementation Details
We use the Transformer-BASE model (Vaswani et al., 2017) as the underlying architecture for our model (E, D 1 , D 2 , extractive summarization model for distillation and baselines). We refer the reader to Vaswani et al. (2017) for hyperparameter details. In the input article, a special token SEP is added at the beginning of each sentence to mark sentence boundaries. For the CNN/DailyMail corpus, the monolingual extractive summarization used in the distillation objective has the same architecture as the encoder E and is trained the CNN/DailyMail corpus constructed by (Liu and Lapata, 2019). To train the encoder with L dis , we take the final hidden representation of each SEP token and apply a 2-layer feed-forward network with ReLU activation in the middle layer and sigmoid at the final layer to get q i for each sentence i (see §2.2).
For the Gigaword dataset, because the inputs and outputs are typically short, we choose keywords rather than sentences as the prediction unit. Specifically, we first use TextRank (Mihalcea and Tarau, 2004) to extract all the keywords from the source document. Then, for each keyword i that appears in the target summary, the gold label q i in equation 1 is assigned to 1, and q i is assigned to 0 for keywords that do not appear in the target side.
We share the parameters of the bottom four layers of the decoder in the multi-task setting. We use the TRIGRAM model in (Wieting et al., 2019b,a) to measure the cross-lingual sentence semantic similarities. As pointed out in §2, after the pre-training stage, we only use D 1 for XLS. The final results are obtained using only E and D 1 . We use two metrics for evaluating the performance of models: ROUGE (1, 2 and L) (Lin, 2004) and XSIM (Wieting et al., 2019b).
Following Paulus et al. (2018), we select γ in equation 3 to 0.998 for the Gigaword Corpus and γ = 0.9984 for the CNN/DailyMail dataset.

Baselines
We compare our proposed method with the following baselines: Pipeline Approaches We report results of summarize-then-translate (SUM-TRAN) and  Table 1: Performance of different models. The highest scores are in bold and statistical significance compared with the best baseline is indicated with * (p <0.05, computed using compare-mt (Neubig et al., 2019)). XSIM is computed between the target language system outputs and the source language reference summaries.

MLE-XLS
We pre-train E and D 1 with only L xls without any fine-tuning.

MLE-XLS+MT
We pre-train E, D 1 and D 2 with L xls + L mt without using L dis . This is the best performing model in (Zhu et al., 2019). We show their reported results as well as results from our re-implementation.

MLE-XLS+MT+DIS
We pre-train the model using (2) without fine-tuning with RL. We also share the decoder layers and add task specific tags to the input as described in §2.2.
RL-ROUGE Using ROUGE score as a reward function has been shown to improve summarization quality for monolingual summarization models (Paulus et al., 2018). In this baseline, we finetune the pre-trained model in the above baseline using ROUGE-L as a reward instead of the proposed XSIM. The ROUGE-L score is computed between the output of D 1 and the machine-generated summaryỹ tgt .

RL-ROUGE+XSIM
Here, we use the average of ROUGE score and XSIM score as a reward function to fine-tune the pre-trained model (MLE-XLS+MT+DIS).

Results
The main results of our experiments are summarized in Table 1. Pipeline approaches, as expected, show the weakest performance, lagging behind even the weakest end-to-end approach by more than 5 ROUGE-L points. TRAN-SUM performs even worse than SUM-TRAN, likely because the  Table 2: Effect of using hard (EXTRACT) vs soft (DIS) extraction of summary sentences from the input article translation model is trained on sentences and not long articles. First translating the article with many sentences introduces way more errors than translating a short summary with fewer sentences would. Using just our pre-training method as described in 2.2 (MLE-XLS+MT+DIS), our proposed model outperforms the strongest baseline (MLE-XLS+MT) in both ROUGE-L (by 0.8) and XSIM (by 0.5).
Applying reinforcement learning to fine-tune the model with both ROUGE (RL-ROUGE), XSIM (RL-XSIM) or their mean (RL-ROUGE+XSIM) as rewards results in further improvements. Our proposed method, RL-XSIM performs the best overall, indicating the importance of using cross-lingual similarity as a reward function. RL-ROUGE uses a machine-generated reference to compute the rewards since target language summaries are unavailable, which might be a reason for its worse performance.

Analysis
In this section, we conduct experiments on the CNN/DailyMail dataset to establish the importance of every part of the proposed method and gain further insights into our model.

Soft Distillation vs. Hard Extraction
The results in table 1 already show that adding the knowl- Figure 3: Reinforcement learning can make the model better at generating long summaries. We use the compare-mt tool (Neubig et al., 2019) to get these statistics.  Table 3: Effect of sharing decoder layers and adding task-specific tags edge distillation objective L dis to the pre-training leads to an improvement in performance. The intuition behind using L dis is to bias the model to (softly) select sentences in the input article that might be important for the summary. Here, we replace this soft selection with a hard selection. That is, using the monolingual extractive summarization model (as described in §3.2), we extract top 5 sentences from the input article and use them as the input to the encoder instead. We compare this method with L dis as shown in Table 2. With just MLE-XLS as the pre-training objective, EXTRACT shows improvement (albeit with lower overall numbers) in performance but leads to a decrease in performance of MLE-XLS+MT. On the other hand, using the distillation objective helps in both cases.

Effect of the Sharing and Tagging Techniques
In Table 3, we demonstrate that introducing simple enhancements like sharing the lower-layers of the decoder (share) and adding task-specific tags (tags) during multi-task pre-training also helps in improving the performance while at the same using fewer parameters and hence a smaller memory footprint.
Effect of Summary Lengths Next, we study how different baselines and our model performs with respect to generating summaries (in Chinese) of different lengths, in terms of number of characters. As shown in Figure 3, after fine-tuning the model with RL, our proposed model becomes better at generating longer summaries than the one with only pre-training (referred to as MLE-XLS+MT+DIS in the figure) with RL-XSIM performing the best in most cases. We posit that this improvement is due to RL based fine-tuning reducing the problem of exposure bias introduced during teacher-forced pre-training, which especially helps longer generations.
Human Evaluation In addition to automatic evaluation, which can sometimes be misleading, we perform human evaluation of summaries generated by our models. We randomly sample 50 pairs of the model outputs from the test set and ask three human evaluators to compare the pre-trained supervised learning model and reinforcement learning models in terms of relevance and fluency. For each pair, the evaluators are asked to pick one out of: first model (MLE-XLS+MT+DIS; lose) , second model(RL models; win) or say that they prefer both or neither (tie). The results are summarized in table 4. We observe that the outputs of model trained with ROUGE-L rewards are more favored than the ones generated by only pre-trained model in terms of relevance but not fluency. This is likely because the RL-ROUGE model is trained using machinegenerated summaries as references which might lack fluency. Figure 4 displays one such example. On the other hand, cross-lingual semantic similarity as a reward results in generations which are more favored both in terms of relevance and fluency.

Related Work
Most previous work on cross-lingual text summarization utilize either the summarize-then-translate or translate-then-summarize pipeline (Wan et al., 2010;Wan, 2011;Yao et al., 2015; Ouyang et al.,

18
A bill to raise the legal age to buy cigarettes was voted into law Wednesday by the City Council. New York is the largest US city to raise the purchase age above the federal limit of 18-years-old. The law is expected to go into effect early next year.

Sup
New York has become the largest purchase age in the United States. New York is not the first city to raise the legal drinking age.

RL-ROUGE
New York has become the largest purchase age in the United States, and the legal age has increased from 18 to 21. The City Council approved a law on Wednesday to increase the age of tobacco purchases from 18 to 21. New York is not the first city to raise the legal drinking age.

21
New York has become the largest city in the United States for buying cigarettes. The City Council approved a law on Wednesday to increase the age of tobacco purchases from 18 to 21. New York is not the first city to raise the legal drinking age. Figure 4: Example outputs. The bilingual semantic similarity rewards can make the output more fluent than using ROUGE-L as rewards. "Sup" refers to the MLE-XLS+MT+DIS baseline.  2019). These methods suffer from error propagation and we have demonstrated their sub-optimal performance in our experiments. Recently, there has been some work on training models for this task in an end-to-end fashion (Ayana et al., 2018;Duan et al., 2019;Zhu et al., 2019), but these models are trained with cross-entropy using machinegenerated summaries as references which have already lost some information in the translation step.
Prior work in monolingual summarization have explored hybrid extractive and abstractive summarization objectives which inspires our distillation objective (Gehrmann et al., 2018;Hsu et al., 2018;Chen and Bansal, 2018). This line of research mainly focus on either compressing sentences extracted by a pre-trained model or biasing the prediction towards certain words.
Language generation models trained with crossentropy using teacher-forcing suffer from exposure bias and a mismatch between training and evaluation objective. To solve these issues, using reinforcement learning to fine-tune such models have been explored for monolingual summarization where ROUGE rewards is typically used (Paulus et al., 2018;Pasunuru and Bansal, 2018). Other rewards such as BERT score  have also been explored (Li et al., 2019). Computing such rewards requires access to the gold summaries which are typically unavailable for cross-lingual summarization. This work is the first to explore using cross-lingual similarity as a reward to work around this issue.

Conclusion
In this work, we propose to use reinforcement learning with a bilingual semantic similarity metric as rewards for cross-lingual document summarization. We demonstrate the effectiveness of the proposed approach in a resource-deficient setting, where target language gold summaries are not available. We also propose simple strategies to better initialize the model towards reinforcement learning by leveraging machine translation and monolingual summarization. In future work, we plan to explore methods for stabilizing reinforcement learning as well to extend our methods to other datasets and tasks, such as using the bilingual similarity metric as a reward to improve the quality of machine translation.