Jointly Learning to Align and Summarize for Neural Cross-Lingual Summarization

Cross-lingual summarization is the task of generating a summary in one language given a text in a different language. Previous works on cross-lingual summarization mainly focus on using pipeline methods or training an end-to-end model using the translated parallel data. However, it is a big challenge for the model to directly learn cross-lingual summarization as it requires learning to understand different languages and learning how to summarize at the same time. In this paper, we propose to ease the cross-lingual summarization training by jointly learning to align and summarize. We design relevant loss functions to train this framework and propose several methods to enhance the isomorphism and cross-lingual transfer between languages. Experimental results show that our model can outperform competitive models in most cases. In addition, we show that our model even has the ability to generate cross-lingual summaries without access to any cross-lingual corpus.


Introduction
Neural abstractive summarization has witnessed rapid growth in recent years. Variants of sequenceto-sequence models have shown to obtain promising results on English (See et al., 2017) or Chinese summarization datasets. However, Cross-lingual summarization, which aims at generating a summary in one language from input text in a different language, has been rarely studied because of the lack of parallel corpora.
Early researches on cross-lingual abstractive summarization are mainly based on the summarization-translation or translationsummarization pipeline paradigm and adopt different strategies to incorporate bilingual features (Leuski et al., 2003;Orasan and Chiorean, 2008;Wan et al., 2010;Wan, 2011) into the pipeline model. Recently, Shen et al. (2018) first propose a neural cross-lingual summarization system based on a large-scale corpus. They first translate the texts automatically from the source language into the target language and then use the teacher-student framework to train a cross-lingual summarization model. Duan et al. (2019) further improve this teacher-student framework by using genuine summaries paired with the translated pseudo source sentences to train the cross-lingual summarization model. Zhu et al. (2019) propose a multi-task learning framework to train a neural cross-lingual summarization model.
Cross-lingual summarization is a challenging task as it requires learning to understand different languages and learning how to summarize at the same time. It would be difficult for the model to directly learn cross-lingual summarization. In this paper, we explore this question: can we ease the training and enhance the cross-lingual summarization by establishing alignment of context representations between two languages?
Learning cross-lingual representations has been proven a beneficial method for cross-lingual transfer for some downstream tasks (Klementiev et al., 2012;Artetxe et al., 2018;Ahmad et al., 2019;. The underlying idea is to learn a shared embedding space for two languages to improve the model's ability for cross-lingual transfer. Recently, it has been shown that this method can also be applied to context representations (Aldarmaki and Diab, 2019;Schuster et al., 2019). In this paper, we show that the learning of cross-lingual representations is also beneficial for neural crosslingual summarization models.
We propose a multi-task framework that jointly learns to summarize and align context-level representations. Concretely, we first integrate monolingual summarization models and cross-lingual summarization models into one unified model and then build two linear mappings to project the context representation from one language to the other. We then design several relevant loss functions to learn the mappers and facilitate the cross-lingual summarization. In addition, we propose some methods to enhance the isomorphism and cross-lingual transfer between different languages. We also show that the learning of aligned representation enables our model to generate cross-lingual summaries even in a fully unsupervised way where no parallel crosslingual data is required.
We conduct experiments on several public crosslingual summarization datasets. Experiment results show that our proposed model outperforms competitive models in most cases, and our model also works on the unsupervised setting. To the best of our knowledge, we are the first to propose an unsupervised framework for learning neural crosslingual summarization.
In summary, our primary contributions are as follow: • We propose a framework that jointly learns to align and summarize for neural cross-lingual summarization and design relevant loss functions to train our system.
• We propose a procedure to train our crosslingual summarization model in an unsupervised way.
• The experimental results show that our model outperforms competitive models in most cases, and our model has the ability to generate cross-lingual summarization even without any cross-lingual corpus.

Overview
We show the overall framework of our proposed model in Figure 1. Our model consists of two encoders, two decoders, two linear mappers, and two discriminators. Suppose we have an English source text x = {x 1 , . . . , x m } and a Chinese source text y = {y 1 , . . . , y n }, which consist of m and n words, respectively. The English encoder φ E X (res. Chinese encoder φ E Y ) transforms x (res. y) into its context representation z x (res. z y ), and the decoder φ D X (res. φ D Y ) reads the memory z x (res. z y ) and generates the corresponding English summaryx (res. Chinese summaryỹ).
The mappers M X : Z x → Z y and M Y : Z y → Z x are used for transformations between z x and z y , and the discriminators D X and D Y are used for discriminating between the encoded representations and the mapped representations.
Taking English-to-Chinese summarization for example, our model generates cross-lingual summaries as follows: First we use the English encoder to get the English context representations, then we use the mapper to map English representations into Chinese space. Lastly the Chinese decoder is used to generate Chinese summaries.
In Section 3, we describe the techniques we adopt to enhance the cross-lingual transferability of the model. In Section 4 and Section 5, we describe the unsupervised training objective and supervised training objective for cross-lingual summarization, respectively.

Normalizing the Representations
In our model, we adopt Transformer (Vaswani et al., 2017) as our encoder and decoder, which is the same with previous works (Duan et al., 2019;Zhu et al., 2019). The encoder and decoder are connected via cross-attention. The cross-attention is implemented as the following dot-product attention module: where S is the packed encoder-side contextual representation, T is the packed decoder-side contextual representation and d k is the model size.
In the dot-product module, it would be beneficial if the contextual representations of the encoder and decoder have the same distributions. However, in the cross-lingual setting, the encoder and decoder deal with different languages and thus the distributions of the learned contextual representations may be inconsistent. This motivates us to explicitly learn alignment relationships between languages.
To make the contextual representations of two languages easier to be aligned, we introduce the normalization technique into the transformer model. Normalizing the word representations has been proved an effective technique on word alignment (Xing et al., 2015). After normalization, two sets of embeddings are both located on a unit hypersphere, which makes them easier to be aligned.
We achieve this by introducing the prenormalization technique and replacing the LayerNorm with ScaleNorm (Nguyen and Salazar, 2019): where F is the -th layer and o is its input. The formula for calculating ScaleNorm is: where g is a hyper-parameter. An additional benefit of ScaleNorm is that after being normalized, the dot-product of two vectors u v is equivalent to their cosine distance u v u v , which may benefit the attention module in Transformer. We will conduct experiments to verify this.

Enhancing the Isomorphism
A key assumption of aligning the representations of two languages is the isomorphism of learned monolingual representations. Some researchers show that the isomorphism assumption weakens when two languages are etymologically distant (Søgaard et al., 2018;Patra et al., 2019). However, Ormazabal et al. (2019) show that this limitation is due to the independent training of two separate monolingual embeddings, and they suggest to jointly learn cross-lingual representations on monolingual corpora. Inspired by Ormazabal et al. (2019), we take the following approaches to address the isomorphism problem.
First, we combine the English and Chinese summarization corpora and build a unified vocabulary. Second, we share encoders and decoders in our model. Sharing encoders and decoders can also enforce the model to learn shared contextual representations across languages. For the shared decoder, to indicate the target language, we set the first token of the decoder to specify the language the module is operating with. Third, we train several monolingual summarization steps before cross-lingual training, as shown in the first line in Alg. 1. The pre-trained monolingual summarization steps also allow the model to learn easier monolingual summarization first, then further learn cross-lingual summarization, which may reduce the training difficulty.

Unsupervised Training Objective
We describe the objective of unsupervised crosslingual summarization in this section. The whole training procedure can be found in Alg. 1.

Summarization Loss
Given an English textsummary pair x and x , we use the encoder φ E X and the decoder φ D X to generate the hypothetical English summaryx that maximizes the output summary probability given the source text: x = arg maxx P (x | x). We adopt maximum loglikelihood training with cross-entropy loss between hypothetical summaryx and gold summary x : where T is the length of x . The Chinese summarization loss L summ Y is similarly defined for the Chinese encoder φ E Y and decoder φ D Y .

Generative and Discriminative Loss Given an
English source text x and a Chinese source text y, we use the encoder φ E X and φ E Y to obtain the contextual representations z x = {z x 1 , . . . , z xm } and z y = {z y 1 , . . . , z yn }, respectively. For Zhto-En summarization, we use the mapper M Y to map z y into the English context space: z y→x = M Y (z y ). We hope the mapped distribution z y→x and the real English distribution z x could be as similar as possible such that the English decoder can deal with cross-lingual summarization just like dealing with monolingual summarization.
To learn this mapping, we introduce two discriminators and adopt the adversarial training (Goodfellow et al., 2014) technique. We optimize the mappers at the sentence-level 1 rather than wordlevel, which is inspired by Aldarmaki and Diab (2019) where they found learning the aggregate mapping can yield a more optimal solution compared to word-level mapping.
Concretely, we first average the contextual representations: Then we train the discriminator D X to discriminate betweenz y→x andz x using the following discriminative loss: is the predicted probability of D X to distinguish whetherz is coming from the real English representation (src = 1) or from the mapper M Y (src = 0).
In our framework, the encoder φ E X and mapper M Y together make up the generator. The generator tries to generate representations which would confuse the discriminator, so its objective is to maximize the discriminative loss in Eq. 5. Alternatively, we train the generator to minimize the following generative loss: Notice that since we use vector averaging and adopt the linear transformation, it does not matter whether we apply the linear mapping before or after averaging the contextual representations, and the learned sentence-level mappers can be directly applied to word-level mappings.
Cycle Reconstruction Loss Theoretically, if we do not add additional constraints, there exist infinite mappings that can align the distribution ofz x and z y , and thus the learned mappers may be invalid. In order to learn better mappings, we introduce the cycle reconstruction loss and back-translation loss to enhance them.
Given z x , we first use M X to map it to the Chinese space, and then use M Y to map it back: We force z x andẑ x to be consistent, constrained by the following cycle reconstruction loss: The cycle reconstruction loss L cyc Y for z y and z y is similarly defined.

Back-Translation Loss
The cycle-reconstructed representationẑ x in Eq. 8 can be regarded as augmented data to train the decoder, which is similar to the back-translation in the Neural Machine Translation area.
Concretely, we use the decoder φ D X to readẑ x and generate the hypothetical summaryx. The back-translation loss is defined as the cross-entropy loss betweenx and gold summary x : The back-translation loss enhances not only the generation ability of the decoder but also the effectiveness of the mapper. The back-translation loss L back Y forẑ y is similarly defined.
Total Loss The total loss for optimizing the encoder, decoder, and mapper of the English side is weighted sum of the above losses: (10) where λ 1 , λ 2 , and λ 3 is the weighted hyperparameters.
The total loss of the Chinese side is similarly defined, and the complete loss of our model is the sum of English loss and Chinese loss: The total loss for optimizing the discriminators is:

Supervised Training Objective
The supervised training objective contains the same summarization loss in unsupervised training objective (Eq. 3). In addition, it has X-summarization loss and reconstruction loss.
Algorithm 1 Cross-lingual summarization Input: English summarization data X and Chinese summarization data Y. for k = 0 to dis iters do 6: Update D X and D Y on L dis in Eq. 5. 7: on L rec in Eq. 14.
X-Summarization Loss Given a parallel English source text x and Chinese summary y . We use φ E X , M X , and φ D Y to generate the hypothetical Chinese summaryỹ, then train them with crossentropy loss: The X-summarization loss for a Chinese text y and English summary x is similarly defined.
Reconstruction Loss Since the cross-lingual summarization corpora are constructed by translating the texts to the other language, the English texts and the Chinese texts are parallel to each other. We can build a reconstruction loss to align the sentence representation for the parallel English and Chinese texts.
Specifically, supposing x and y are parallel source English and Chinese texts, we first use φ E X and φ E Y to obtain contextual representations z x and z y , respectively. Then we average the contextual representations to get their sentence representations and use the mappers to map them into the other language. Since the English and Chinese texts are translations to each other, the semantics of their sentence representations should be the same. Thus we design the following reconstruction loss: and L rec Y is similarly defined. Notice that the generative and discriminative loss, cycle-construction loss, and back-translation loss are unnecessary here because we can directly use aligned source text with objective 14 to align the context representations.
Total Loss The total loss for training the English side is: where λ 1 and λ 2 is the weighted hyper-parameters. The total loss of the Chinese side is similarly defined.

Experiment Settings
We conduct experiments on English-to-Chinese (En-to-Zh) and Chinese-to-English (Zh-to-En) summarizations. Following Duan et al. (2019), we translate the source texts to the other language to form the (pseudo) parallel corpus. Since they do not release their training data, we translate the source text ourselves through the Google translation service. Notice that Zhu et al. (2019)  Notice that the test sets provided by Zhu et al. (2019) are unprocessed, therefore we have to process the test samples they provided ourselves.

Dataset
Gigaword English Gigaword corpus (Napoles et al., 2012) contains 3.80M training pairs, 2K validation pairs, and 1,951 test pairs. We use the human-translated Chinese source sentences provided by (Duan et al., 2019) to do Zh-to-En tests.
DUC2004 DUC2004 corpus only contains test sets. We use the model trained on gigaword corpus to generate summaries on DUC2004 test sets. We use the 500 human-translated test samples provided by (Duan et al., 2019) to do Zh-to-En tests.
LCSTS LCSTS (Hu et al., 2015) is a Chinese summarization corpus, which contains 2.40M training pairs, 10,666 validation pairs, and 725 test pairs. We use 3K cross-lingual test samples provided by Zhu et al. (2019) to do Zh-to-En tests.
CNN/DM CNN/DM (Hermann et al., 2015) contains 287.2K training pairs, 13.3K validation pairs, and 11.5K test pairs. We use the 3K cross-lingual test samples provided by Zhu et al. (2019) to do En-to-Zh cross-lingual tests.

Evaluation Metrics
We use ROUGE-1 (unigram), ROUGE-2 (bigram), and ROUGE-L (LCS) F1 scores as the evaluation metrics, which are most commonly used evaluation metrics in the summarization task.

Competitive Models
For unsupervised cross-lingual summarization, we set the following baselines: • Unified It jointly trains English and Chinese monolingual summarizations in a unified model and uses the first token of the decoder to control whether it generates Chinese or English summaries.
• Unified+CLWE It builds a unified model and adopts pre-trained unsupervised cross-lingual word embeddings. The cross-lingual word embeddings are obtained via projecting embeddings from source language to target language. We use Vecmap 2 to learn the cross-lingual word embeddings.
For supervised cross-lingual summarization, we compare our model with (Shen et al., 2018), (Duan et al., 2019), and Zhu et al. (2019). We also consider the following baselines for comparison: 2 https://github.com/artetxem/vecmap • Pipe-TS The Pipe-TS baseline first uses a Transformer-based translation model to translate the source text to the other language, then uses a monolingual summarization model to generate summaries. To make this baseline stronger, we replace the translation model with the Google translation system and name it as Pipe-TS*.
• Pipe-ST The Pipe-ST baseline first uses a monolingual summarization model to generate the summaries, then uses a translation model to translate the summaries to the other language. We replace the translation model with the Google translation system as Pipe-ST*.
• Pseudo The Pseudo baseline directly trains a cross-lingual summarization model by using the pseudo parallel cross-lingual summarization data.
• XLM Pretraining This method is proposed by Lample and Conneau (2019), where they pretrain the encoder and decoder on largescale multilingual text using causal language modeling (CLM), masked language modeling (MLM), and translation language modeling (TLM) tasks. 3

Implementation Details
For transformer architectures, we use the same configuration as Vaswani et al. (2017), where the number of layers, model hidden size, feed-forward hidden size, and the number of heads are 6, 512, 1024, and 8, respectively. We set g = √ d model = √ 512 in ScaleNorm. The mapper is a linear layer with a hidden size of 512, and the discriminator is a two-layer linear layer with a hidden size of 2048.
We use the NLTK 4 tool to process English texts and use jieba 5 tool to process Chinese texts. The vocabulary size of English words and Chinese words are 50,000 and 80,000 respectively. We set λ 1 = 1, λ 2 = 5, λ 3 = 2 in unsupervised training and λ 1 = 0.5, λ 2 = 5 in supervised training according to the performance of the validation set. We set dis iters = 5 in Alg. 1. We use Adam optimizer (Kingma and Ba, 2014) with β = (0.9, 0.98) for optimization. We set the learning rate to 3e − 4 and adopt the warm-up learning rate (Goyal et al., 2017) for the first 2,000 steps, the initial warm-up learning is set to 1e − 7. We adopt the dropout technique and set the dropout rate to 0.2.

Unsupervised Cross-Lingual Summarization
The experiment results of unsupervised crosslingual summarization are shown in Table 2, and it can be seen that our model significantly outperforms all baselines by a large margin. By training a unified model of all languages, the model's crosslingual transferability is still poor, especially for the gigaword dataset. Incorporating cross-lingual word embeddings into the unified model can improve the performance, but the improvement is limited. We think this is due to that the cross-lingual word embeddings learned by Vecmap cannot leverage the contextual information. Due to space limitations, we present case studies in the Appendix. After checking the generated summaries of the two baseline models, we find that they can generate readable texts, but the generated texts are far away from the theme of the source text. This indicates that the encoder and decoder of these baselines have a large gap, such that the decoder cannot understand the output of the encoder. We also find that summaries generated by our model are obviously more relevant, demonstrating that aligned representations between languages are helpful.
But we can also see that there is still a gap be-  tween our unsupervised results (Table 2) and supervised results (Table 1), indicating that our model has room for improvement.

Supervised Cross-Lingual Summarization
The experiment results of supervised cross-lingual summarization are shown in Table 1. Due to the lack of corpus for training Chinese long document summarization model, we do not experiment with the Pipe-TS model on the CNN/DM dataset. By comparing our results with pipeline-based or pseudo baselines, we can find that our model outperforms all these baselines in all cases. Our model achieves an improvement of 0∼3 Rouge scores over the Pseudo model trained directly with translated parallel cross-lingual corpus, and 1.5∼4 Rouge-1 scores over those pipeline models. We also observe that models using the Google translation system all perform better than models using the Transformer-based translation system. This may because the Transformer-based translation system will bring some "UNK" tokens, and the transformer-based translation system trained by ourselves does not perform as well as the Google translation system. In addition, Pipe-ST models perform better than Pipe-TS models, which is con-   sistent with the conclusions of previous work. This is because (1) the translation process may discard some informative clauses, (2) the domain of the translation corpus is different from the domain of summarization corpus, which will bring the domain discrepancy problem to the translation process, and (3) the translated texts are often "translationese" . The Pseudo model performs better than Pipe-TS models but performs similarly as Pipe-ST models. By comparing our results with others, we can find that our model outperforms Shen et al. (2018) and Duan et al. (2019) on both gigaword and DUC2004 test sets, and it outperforms Zhu et al. (2019) on the LCSTS dataset. But our Rouge scores are lower than Zhu et al. (2019) on the CNN/DM dataset, especially the Rouge-2 score. However, our model performs worse than pretrained models.

Human Evaluation
The human evaluation was also performed. Since we cannot get the summaries generated by other models, we only compare with our baselines in the human evaluation. We randomly sample 50 examples from the gigaword (Zh-to-En) test set and 20 examples from the CNN/DM (En-to-Zh) test set. We ask five volunteers to evaluate the quality of the generated summaries from the following three aspects: (1) Informative: how much does the generated summaries cover the key content of the source text? (2) Conciseness: how concise are the generated summaries? (3) Fluency: how fluent are the generated summaries? The scores are between 1-5, with 5 being the best. We average the scores and show the results in Table 3 and Table 4.
Our model exceeds all baselines in informative and conciseness scores, but get a slightly lower fluency score than Pipe-ST*. We think this is because the Google translation system has the ability to identify grammatical errors and generate fluent sentences.

Ablation Tests
To study the importance of different components of our model, we also test some variants of our model. For supervised training, we set variants of (1) without (monolingual) summarization loss, (2) without mappers 6 , (3) replace ScaleNorm with LayerNorm, (4) without pre-trained monolingual steps, and (5) unshare the encoder and decoder. For unsupervised training, we additionally set variants without cyc-reconstruction loss or back-translation loss. The results of ablation tests of supervised and unsupervised cross-lingual summarization are shown in Table 5 and Table 6, respectively.
It seems that the role of mappers does not seem obvious in the case of supervised training. We speculate that this may be due to the joint training of monolingual and cross-lingual summarizations, and directly constraining the context representations before mapping can also yield shared (aligned) representations. But mappers are crucial for unsupervised cross-lingual summarization. For supervised cross-lingual summarization, except for mappers, all components contribute to the improvement of the performance. The performance decreases after removing any of the components. For unsupervised cross-lingual summarization, all components contribute to the improvement of the performance and the mappers and shared encoder/decoder are key components. Early researches on cross-lingual abstractive summarization are mainly based on the monolingual summarization methods and adopt different strategies to incorporate bilingual information into the pipeline model (Leuski et al., 2003;Orasan and Chiorean, 2008;Wan et al., 2010;Wan, 2011;Yao et al., 2015).
Recently, some neural cross-lingual summarization systems have been proposed for cross-lingual summarization (Shen et al., 2018;Duan et al., 2019;Zhu et al., 2019). The first neural-based crosslingual summarization system was proposed by Shen et al. (2018), where they first translate the source texts from the source language to the target language to form the pseudo training samples. A teacher-student framework is adopted to achieve end-to-end cross-lingual summarization. Duan et al. (2019) adopt a similar framework to train the cross-lingual summarization model, but they translate the summaries rather than source texts to strengthen the teacher network. Zhu et al. (2019) propose a multi-task learning framework by jointly training cross-lingual summarization and monolingual summarization (or machine translation). They also released an English-Chinese cross-lingual summarization corpus with the aid of online translation services.

Learning Cross-Lingual Representations
Learning cross-lingual representations is a beneficial method for cross-lingual transfer. Conneau et al. (2017) use adversarial networks to learn mappings between languages without supervision. They show that their method works very well for word translation, even for some distant language pairs like English-Chinese. Lample et al. (2018) learn word mappings between languages to build an initial unsupervised machine translation model, and then perform iterative backtranslation to fine-tune the model. Aldarmaki and Diab (2019) propose to directly map the averaged embeddings of aligned sentences in a parallel corpus, and achieve better performances than wordlevel mapping in some cases.

Conclusions
In this paper, we propose a framework that jointly learns to align and summarize for neural crosslingual summarization. We design training objectives for supervised and unsupervised cross-lingual summarizations, respectively. We also propose methods to enhance the isomorphism and crosslingual transfer between languages. Experimental results show that our model outperforms supervised baselines in most cases and outperforms unsupervised baselines in all cases.
We use the PCA (Wold et al., 1987) algorithm to visualize the pre-and post-aligned context representations of our model in Figure 2. The left picture shows the original distribution of two languages, and the right picture shows the distribution after we map Chinese representations to English. Figure 2 reveals that the representations of the two languages are originally separated but become aligned after our proposed procedure, which demonstrates that our proposed alignment procedure is effective.

B Case Studies
We show four cases of Chinese-to-English summarization in Table 7. Since most of the summaries generated by other unsupervised baselines are meaningless (e.g., far away from the theme of the source text, all tokens are "UNK" and so on), we don't show their results here.