Adversarial Domain Adaptation Using Artificial Titles for Abstractive Title Generation

A common issue in training a deep learning, abstractive summarization model is lack of a large set of training summaries. This paper examines techniques for adapting from a labeled source domain to an unlabeled target domain in the context of an encoder-decoder model for text generation. In addition to adversarial domain adaptation (ADA), we introduce the use of artificial titles and sequential training to capture the grammatical style of the unlabeled target domain. Evaluation on adapting to/from news articles and Stack Exchange posts indicates that the use of these techniques can boost performance for both unsupervised adaptation as well as fine-tuning with limited target data.


Introduction
Many types of textual content, such as conversations and posts on chat, do not have a title or summary. While multi-sentence extractive summarization can give a sense of the content of an article, a title or highlight is more concise. Such short summaries can be generated using abstractive summarization with an RNN encoder-decoder model, e.g., .
A common issue when training models for abstractive summarization of conversations and posts is the lack of a large set of text with summaries. Obtaining good quality labeled data can be difficult and expensive, especially if authorgenerated summaries are desired. One option is to train on data from another domain with authorgenerated titles, but because of differences between domains, the performance may be less than adequate. These differences include different vocabularies, different grammatical styles, and different ways of expressing similar concepts. Vocabulary expansion may be used to address the different vocabularies in source and target domains, and adversarial domain adaptation (ADA) may be used to merge the embedded feature representations across domains. However, ADA does not adapt the decoder in an encoder-decoder generation model.
In this paper, we investigate the utility of these techniques in unsupervised domain adaptation for title generation. We also examine the use of a limited amount of labeled training data from the target domain, when high performance may be required but training data is not easily available. Our contributions include (1) proposing the use of artificial titles for unlabeled target documents to train a decoder to learn the grammatical style of titles in the new domain (2) proposing to train the decoder in a sequence of steps that encourages the source and target embedding spaces to remain aligned during adaptation, and (3) showing that our model improves performance over ADA and an expanded vocabulary alone and further, that a limited amount of labeled target data can achieve performance close to training on all labeled target data.

Related Work
Our model draws from work on abstractive summarization and unsupervised domain adaptation. Recently, a number of neural encoder-decoder models have been proposed for abstractive summarization e.g., (Rush et al., 2015;Chen et al., 2016a;Chopra et al., 2016;Li et al., 2017;Narayan et al., 2018;Hsu et al., 2018), with one of the better performing models being (See et al., 2017), which serves as our base model. Supervised domain adaptation methods have been proposed for generative models. (Hua and Wang, 2017) found that pre-training an abstractive summarizer with extractive summaries does not always improve performance, but (Chen et al., 2015) noted that fine-tuning a model trained on source domain data with limited target domain data does improve performance.
A variety of techniques have been proposed for unsupervised domain adaptation of deep learning systems for classification, e.g., (Hsu et al., 2017;Tzeng et al., 2017;Ganin et al., 2016;Chen et al., 2016b;Ghifary et al., 2016). However, all used the aligned encoder representation for classification but not generation.
We adapt the domain-adversarial method for feature alignment in an encoder proposed by (Ganin et al., 2016). However, for text generation, a domain-independent representation from the encoder, as used in domain adaptation for classification, is not adequate. We also require the decoder to be adapted to varying domains to generate output appropriate for the target domain, an issue that we investigate in the context of title generation.
Jointly training a translation model with mixed labeled data from two domains can improve performance over training on one domain only (Pryzant et al., 2017). In contrast, our domain adaptation method trains sequentially on data, first with the unlabeled target domain data.

Domain-Adapted Title Generation
Our goal is to improve performance when labeled data from one domain, the source, is used to train a model which is then applied to another domain with no or only limited labeled data, the target.

Adversarial Domain Adaptation (ADA)
The embedded representation generated by the encoder, which represents the "concepts" in the input text, may differ across domains. To address this, we adapt the method proposed by (Ganin et al., 2016), which uses a domain classifier to force the concept representations to align across domains.
We use an encoder-decoder RNN model with domain adaptation (Figure 1) for title generation. Labeled source data is fed to the encoder and the decoder learns to generate summary titles. At the same time, the source data and unlabeled target domain data are encoded by a bidirectional LSTM as their concept representations, and the domain classifier tries to learn to differentiate between the representations of two domains.
The domain classifier has two dense, 100-unit hidden layers followed by a softmax. The concept representation vector is computed as the bidirectional LSTM encoder's final forward and backward hidden states concatenated into a single state. During training, the gradient from the domain classifier, ∂L d ∂θ d , is "reversed" to be negative before being propagated back through the encoder as − ∂L d ∂θc , encouraging the embedded representations to align by adjusting the feature distributions to maximize the loss of the domain classifier.
In contrast to the two classification losses used by (Ganin et al., 2016) for training the model, we use the generated sequence loss together with the adversarial domain classifier loss: where, following (See et al., 2017), the decoder (sequence) loss is the negative log likelihood of the target word w * t at position t. The domain classifier loss, L d , is the cross-entropy loss between the predicted and true domain label probabilities, λ is a parameter relating the two losses. We followed the schedule from (Ganin et al., 2016) for adjusting λ for the encoder: λ was increased from 0.0 to 1.0 by increasing p from 0.0 to 1.0 over 5000 iterations, at which point we observed that the domain adaptation classifier loss was reaching an asymptote. λ was then held equal to 1.0 and training continued until validation performance for title generation reached an asymptote (when training on artificial titles or source data) or overtraining occurred (when training on limited target data). When updating the domain classifier, λ was set equal to one.

Artificial Titles
The style of the unlabeled target may be different from the source, e.g., Stack Exchange is more casual and includes more slang than news articles.
To capture the style of the unlabeled target, "artificial" titles were synthesized. Since titles tend to be short and encode-decoder models learn to model sentence length, target text between 4-10 words in length were selected. A common summary baseline is the first few sentences of a news article e.g. (Zajic et al., 2004;; some social media sites, including Trip Advisor, Facebook and Reddit, display the first words of long posts. For example, this paragraph might be shown as "The style of the unlabeled target may ...". The first text meeting the length requirement was selected 90% of the time and the second text meeting the requirement selected otherwise. For Stack Exchange, the text was a sentence from a post, and for news, where titles are often phrases, the text was a clause. Training on first text only, the loss dropped below 0.001 in less than 3k iterations, indicating the model had learned to copy from the first sentence. Use of the second text discourages this so that both the encoder and decoder are trained on text from the target domain (enabling use of an expanded, joint vocabulary trained on both source and target) to learn its style and vocabulary. However, the artificial titles will generally be different from the real titles, which may lead to lower summarization performance.

Sequential Training
Our adaptation method, ASADA, is shown in Figure 2: a) A model with a joint vocabulary is first pre-trained on artificial titles for the unlabeled target domain (Section 3.2). b) The embedding space of the pre-trained model is then adapted to the source domain using ADA (Section 3.1) to continue training on the target domain with the source domain as the auxiliary adaptation data. c) With a joint embedding space defined, the model is trained on the source domain, which has title-text pairs, and the unlabeled target domain is used as the auxiliary adaptation data to keep the model embedding aligned with the target data.

Dataset
We used data from two domains: the public CNN/Dailymail (News) dataset used by (See et al., 2017) and posts from 20 Stack Exchange (StackEx) channels 1 with a bias towards those that are business related (see Appendix A for details). To reduce training time, each article was truncated to 200 words. We limited the data to those with title lengths of 10 words or less for use in finetuning because some were longer sentences rather than titles. (See Table 1) The News datasets were formatted as in (See et al., 2017). The StackEx dataset was randomly divided into train (90%), validation (5%) and test (5%).

Experiments
For all experiments, the Pointer-Generator model  by (See et al., 2017) was used without coverage as our base model, since coverage is an additional training step that would add an additional variable to the comparisons. Although coverage improves performance by reducing repetitive words, we chose to examine the effects of different domain adaptation methods without it. For handling differences in vocabulary, the vocabulary of the labeled source and unlabeled target domains were combined. The union of the 50k most frequent terms from the training data of each domain produced a joint vocabulary of about 85k terms. When an individual vocabulary was used, the size was 50k words. When sequential training was used, a model was trained until the loss on a validation set reached an asymptote. Domain adaptation experiments from News to StackEx and from StackEx to News were conducted, first without target domain summary titles and then with a limited amount of target domain titles.

Unsupervised Target Domain Adaptation
For our investigations on domain adaptation when labeled target domain data is unavailable, models trained on source domain labels only and with a mix of source domain labels and artificial target labels are our baselines.
Effect of ADA and Vocabulary The top section of Table 2  The mixed results using a joint vocabulary reflect the better coverage of the added target words outside the source's top-50k vocabulary when the source is News vs. StackEx (see Appendix B). And when a joint vocabulary (S+T) is used, ADA (c) improves performance over training only on the source S (b), as expected. Effect of Artificial Titles and Sequential Training The second section of Table 2 compares ap-proaches using artificial titles: (d) T art : a model pre-trained on target domain articles/posts with artificial target domain titles (e) T art , S ADA : model (d), further trained on the source with ADA to the target without labels. (f) T art ,T ADA art ,S ADA : ASADA. Model (d), followed by adapting the model, which has been trained on the target domain with non-optimal summaries, to source data, aligning the embedded representations of the two domains. Then the model is trained on source data with ADA to the unlabeled target to learn how to summarize while keeping the embedded representations aligned. (g) ASADA using the lead-1 (first) sentence in place of T art . The better performance in (f) supports ASADA'a use of artificial titles.
ASADA's two-step adaptation with artificial titles performed best out of all models. The mixed performance of training on T art indicates the artificial title quality is lower for StackEx, (d) vs. (b). The weakly better performance of (e) over (c) indicates that applying S ADA directly forgets much of T art . The relative improvement of ASADA over training only on source was 25% (from News to StackEx) and 30% (from StackEx to News). This indicates that T ADA art allows the model to remember the vocabulary and style from T art while learning how to summarize by S ADA . Table 3 illustrates differences between the onestep adaptation model (e), with id (E) and the twostep adaptation used in ASADA (F1 and F2). In both, the model is first trained on the target domain using T art . In model (e), ADA then trains the encoder on source only and ignores T art , gradually giving greater weight to the domain classifier, which uses the target data (see Sec. 3.1). At the same time, the labeled data domain is switched to the source domain, so that both the embedding and decoder domains are abruptly changed. In contrast, in ASADA the embedding is gradually adapted from the target domain to jointly embed the source and target (F1). Only then is the target domain changed (F2).
In the third section, the labeled source is mixed with target domain artificial titles and trained using (Pryzant et al., 2017)'s Discriminative Mixed (DM) and Adversarial Discriminative Mixed (ADM) machine translation models. ADM is similar to ADA in that both use and adversarial classifier; however, for ADM both domains have labeled data. ASADA's better performance indicates that first pre-training with artificial titles to learn vocabulary and style and then adapting to the source to learn to summarize is better than jointly mixing artificial and true titles.

Limited Target Domain Labels
We next examine adaptation performance when a limited amount of labeled data is available for the target domain. Our best model for each domain, ASADA, is refined by training on various percentages of the labeled target domain training data and referred to as '* DA' in Figure 3. For comparison, a baseline model was trained using labeled source domain data and then fine-tuned (Sun et al., 2016;Song et al., 2017) using labeled target domain data and is shown as '* FT'.
Note that (1) when labeled target domain data is very limited, say 3,000 labeled samples, '* DA' improves performance more than '* FT' (2) as the amount of labeled target data increases, the performance with and without ADA increases, and with 30% of the target data (rightmost points) is close to or exceeds using 100% of the target data.

Visualization of Adaptation Models
Embedded points produced by models (d), (e) and (f) (see Section 5.1) are compared in the visualization in Figure 4. For the one-step adaptation model, (e), embedded points are shown partway through adaptation with ADA (i.e., p in Eqn. (4) is approximately 0.5) and after adaptation. The embedding partway through adaptation, labeled artif,srcADAmid, has moved away from the T art embedding (model (d), labeled artif ). After adaptation, labeled artif,srcADA, the embedded points are only slightly closer to the T art embedded points. In contrast, the ASADA (f) embedding is closer to the T art embedding and more compact, as is T art . This supports our hypothesis that ASADA retains more of what was learned from the initial target embedding than model (e)'s onestep adaptation, contributing to ASADA's better performance.

Summary
We investigated unsupervised domain adaptation methods for an encoder-decoder model. We proposed the use of artificial titles for training a decoder to the target domain vocabulary and style and sequential adversarial domain adaptation to minimize rapid changes of the encoder embedding space. Our experiments show that our proposed approach performed best when compared to baseline adaptation techniques when unsupervised. And with very limited target domain labels for fine-tuning, our model performed better than fine-tuning a model trained on the source domain.
In the future, we would like to understand the usefulness of artificial titles for training the decoder relative to other factors that may impact performance, e.g., how similar the true titles or summaries are in the different domains.

B Cross-Domain Vocabulary Coverage
For the expanded, joint vocabulary of source and target, Figure 5 shows that the number of News target tokens not represented by StackExchange vocabulary terms is much larger than the number of Stack Exchange target tokens not represented by News vocabulary terms. When trained on source only, these unrepresented target domain tokens are neither trained nor handled by the pointergenerator mechanism. Adversarial Domain Adaptation enables training of the encoder on these target tokens. Artificial Titles enable the decoder to be trained on these tokens.