A Pilot Study of Domain Adaptation Effect for Neural Abstractive Summarization

We study the problem of domain adaptation for neural abstractive summarization. We make initial efforts in investigating what information can be transferred to a new domain. Experimental results on news stories and opinion articles indicate that neural summarization model benefits from pre-training based on extractive summaries. We also find that the combination of in-domain and out-of-domain setup yields better summaries when in-domain data is insufficient. Further analysis shows that, the model is capable to select salient content even trained on out-of-domain data, but requires in-domain data to capture the style for a target domain.


Introduction
Recent text summarization research moves towards producing abstractive summmaries, which better emulates human summarization process and produces more concise summaries (Nenkova et al., 2011). Built on the success of sequenceto-sequence learning with encoder-decoder neural networks (Bahdanau et al., 2014), there has been growing interest in utilizing this framework for generating abstractive summaries (Rush et al., 2015;Wang and Ling, 2016;Takase et al., 2016;Nallapati et al., 2016;See et al., 2017). The end-to-end learning framework circumvents efforts in feature engineering and template construction as done in previous work (Ganesan et al., 2010;Wang and Cardie, 2013;Gerani et al., 2014;Pighin et al., 2014), by directly learning to detect summary-worthy content as well as generate fluent sentences.
Nevertheless, training such systems requires large amounts of labeled data, which creates a big hurdle for new domains where training data is scant and expensive to acquire. Consequently, we raise the following research questions: Input (News):The Department of Defense has identified 441 American service members who have died since the start of the Iraq war. It confirmed the death of the following American yesterday: DAVIS, Raphael S., 24, specialist, Army National Guard; Tutwiler, Miss.; 223rd Engineer Battalion. Abstract: Name of American newly confirmed dead in Iraq ; 441 American service members have died since start of war. Input (Opinion): WHEN the 1999 United States Ryder Cup team trailed the Europeans, 10-6, going into Sunday's 12 singles matches at the Country Club outside Boston, Ben Crenshaw, the United States captain, issued a declaration of confidence in his golfers. "I'm a big believer in faith ," Crenshaw said firmly in his Texas twang . " I have a good feeling about this." The next day , Crenshaw' cavalry won the firsts even singles matches. With a sudden 13-10 lead , the turnaround put unexpected pressure on the Europeans, . . . Abstract: Dave Anderson Sports of The Times column discusses US team's poor performance against Europe in Ryder Cup. Figure 1: A snippet of sample news story and opinion article from The New York Times Annotated Corpus (Sandhaus, 2008).
• domain adaptation: whether we can leverage available out-of-domain abstracts or extractive summaries to help train a neural summarization system for a new domain?
• transferable component: what information is transferable and what are the limitations?
In this paper, we attempt to shed some light on the above questions by investigating neural summarization on two types of documents with major difference: news stories and opinion articles from The New York Times Annotated Corpus (Sandhaus, 2008). Sample articles and human written abstracts are shown in Figure 1. We select a reasonably simple task on generating short news summary for multi-paragraph documents.
Contributions. We first investigate the effect of parameter initialization via pre-training on extractive summaries. A large-scale dataset consisting of 1 million article-extract pairs is collected from The New York Times for use. Experimental results show that this step improves summarization performance measured by ROUGE (Lin, 2004) and BLEU (Papineni et al., 2002).
We then treat news stories as source domain and opinion articles as target domain, and make initial tries for understanding the feasibility of domain adaptation. Importantly, by testing on opinion article summarization, the model leveraging data from both source and target domains yields better performance than in-domain trained model when in-domain training data is rare.
Furthermore, we interpret the learned model to understand what information is transferred to a new domain. In general, a model trained on out-of-domain data can learn to detect summaryworthy content, but may not match the generation style in the target domain. Concretely, we observe that the model trained on news domain pays similar amount of attention to summary-worthy content (i.e., words reused by human abstracts) when tested on news and opinion articles. On the other hand, human writers tend to employ new words unseen from the input when constructing opinion abstracts. End-to-end evaluation results imply that the model trained on out-of-domain data fails to capture this aspect.
The above observations suggest that the neural summarization model learns to 1) identify salient content, and 2) generate summaries with a style as in the training data. The first element might be transferable to a new domain, while not so much for the second.

The Neural Summarization Model
In this work, we choose the attentional sequenceto-sequence model with pointer-generator mechanism (See et al., 2017) for study. Briefly, the model learns to generate a sequence of tokens {y i } based on the following conditional probability: p(y i = w|y 1 , . . . , Here P vocab (w) denotes the probability to generate a new word from vocabulary, p gen is a learned parameter that chooses between generating and copying, depending on the hidden states and attention distribution. This model enhances the original attention model (Bahdanau et al., 2014) by incorporating pointer-network (Vinyals et al., 2015), which allows the decoder to copy accurate information from input. Due to space limitation, we refer the readers to original paper (See et al., 2017) for model details.
For experiments, we employ bidirectional recurrent neural network (RNN) as encoder and unidirectional RNN as decoder, both implemented by Long Short Term Memory (LSTM) with 256 hidden units. Input and output data are lowercased as described in (See et al., 2017).

Datasets and Experimental Setup
Primary Data. Our primary data source is The New York Times Annotated Corpus (Sandhaus, 2008) (henceforth called NYT-annotated). Compared with other commonly used dataset for abstractive summarization, NYT-annotated has more variation in its abstracts, such as paraphrase and generalization. It also comes with other human labels we could use to characterize the type of articles. The whole dataset consists of 1.8 million articles, of which 650,000 are annotated with human constructed abstracts. Articles longer than 15 tokens and abstracts longer than 10 tokens are extracted for use in our study (as in Figure 1).
The resulting dataset are further separated into two types based on their taxonomy tags 1 : NEWS stories and OPINION articles. We believe these two types of documents are different enough in terms of topics, summary style, and lexical level language use, that they could be treated as different domains for our study. We collected 100,824

Subjectivity Distribution On Abstracts
News Opinion We also make use of the section tag, such as Business, Sports, Arts, to calculate the topic distribution for these two domains. About 57% of the documents of NEWS are about Sports, whereas more than 78% documents of OPINION are about Arts. We also observe different levels of subjectivity based on the percentage of strong subjective words taken from MPQA lexicon (Wilson et al., 2005). On average 4.1% of the tokens in OPINION articles are strong subjective, compared to 2.9% for NEWS stories. This shows the topics and word usage are essentially different between these two domains.
Characterizing Two Domains. Here we characterize the difference between NEWS and OPINION by analyzing the distribution of word types in abstracts and how often human reuse words from input text to construct the summaries. Overall, 81.3% of the words in NEWS abstracts are reused from input, compared with 75.8% for OPINION. The distribution for words of different part-ofspeech is displayed on the left of Figure 2, which shows that there are relatively more Nouns in OPINION. In the same figure, we display the percentage of words in abstract that are reused from input, which suggests that human tends to reuse more nouns and verbs for NEWS abstracts. Furthermore, the distribution of Named Entities words and subjective words in abstracts are depicted in Figure 3.
Model Pre-training Dataset. We further collect lead paragraphs and article descriptions for 1,435,735 articles from The New York Times API 2 . About 71% of these descriptions are the first sentences in the lead paragraphs, and thus can be considered as extractive summaries. About one million lead paragraph and description pairs are retained for pre-training 3 (henceforth NYTextract).
Training Setup. We randomly divide NYTannotated into training (75%), validation (15%), and test (10%)  Evaluation Metrics. We use automatic evaluation on recall-oriented ROUGE (Lin, 2004) and precision-oriented BLEU (Papineni et al., 2002). We consider ROUGE-2 which measures bigram recall, and ROUGE-L which takes into account the longest common subsequence. We also evaluate on BLEU which measures precision up to bigrams.

Results
Effect of Pre-training with Extracts. We first evaluate whether pre-training can improve summarization performance for IN-DOMAIN setups, where we initialize model parameters by training on NYT-extract for about 20,000 iterations. Otherwise, parameters are randomly initialized. Results are displayed in Table 1. We also consider two baselines, BASELINE1 outputs the first sentence, BASLINE2 selects the first 22 (news) and 15 (opinion) tokens (with similar lengths as human summaries).
As can be seen, the pre-training step improves performance for NEWS, whereas the performance on OPINION remains roughly the same. This might be due to the fact that news abstracts reuse more words from input, which are closer to extractive summaries than opinion abstracts. We further classify the words in gold-standard summaries based on if they are seen in abstracts during training and then whether they are taken from the input text. We examine whether they are generated correctly. Full training set of opinion is used for in-domain and mix-domain training. Table 2 shows that among in-domain models, the model trained for news are superior at generating tokens mentioned in the input, compared to the model trained for opinion (33.7% v.s. 22.0%). Nonetheless, model trained for opinion is better at generating new words not in the input (8.2% vs. 2.6%). This is consistent with our observation that in opinion domain human editors favors   new words different from the input. Further Analysis.

R-2 R-L BLEU Avg Len
Here we study what information is transferable cross domains by investigating the attention weights assigned to the input text.
What can be transferred. We start with input words with highest attention weights when generating the summaries. Among these, we show the percentage over different word categories as in Table 3. For named entities, model trained on out-ofdomain data pays more attention to PERSON and less attention to ORGANIZATION, while the indomain trained model does reverse . This is consistent with the fact that opinion abstracts contains more PERSON and less ORGANIZATION than news abstracts (see Figure 3). This suggests that the identification of summary-worthy named entities might be transferable from NEWS to OPINION. Similar effect is also observed for nouns and verbs, though less significant. Attention change for domain adaptation. We also examine the percentage of attention paid to summary-worthy words. For every output token we pick the input token with highest attention weight, and count the ones reused by hu-Human: stephen holden reviews carnegie hall concert celebrating music of judy garland. singers include her daughter, lorna luft. Out-of-Domain: article discusses possibility of carnegie hall in carnegie hall golf tournament. Mix-Domain: stephen holden reviews performance by jazz singer celebration by rainbow and garland at carnegie, part of tribute hall. Human: janet maslin reviews john grisham book the king of torts .
Out-of-Domain: interview with john grisham of legal thriller is itself proof for john grisham 376 pages. Mix-Domain: janet maslin reviews book the king of torts by john grisham . Human: anthony tommasini reviews 23d annual benefit concert of richard tucker music foundation , featuring members of metropolitan opera orchestra led by leonard slatkin .
Out-of-Domain: final choral society and richard tucker music foundation , on sunday night in [UNK] fisher hall , will even longer than substantive 22d gala last year . Mix-Domain: anthony tommasini reviews 23d annual benefit concert of benefit of richard tucker music.

Related Work
Domain adaptation has been studied for a wide range of natural language processing tasks (Blitzer et al., 2007;Florian et al., 2004;Daume III, 2007;Foster et al., 2010). However, little has been done for investigating summarization systems (Sandu et al., 2010;Wang and Cardie, 2013). To the best of our knowledge, we are the first to study the adaptation of neural summarization models for  new domain. Furthermore, Recent work in neural summarization mainly focuses on specfic extensions to improve system performance (Rush et al., 2015;Takase et al., 2016;Gu et al., 2016;Nallapati et al., 2016;Ranzato et al., 2015). It is unclear how to adapt the existing neural summarization systems to a new domain when the training data is limited or not available. This is a question we aim to address in this work.

Conclusion
We investigated domain adaptation for abstractive neural summarization. Experimental results showed that pre-training model with extractive summaries helps. By analyzing the attention weight distribution over input tokens, we found the model was capable to select salient information even trained on out-of-domain data. This points to future direcions where domain adaptation techniques can be developed to allow a summarization system to learn content selection from out-of-domain data while acquiring language generating behavior with in-domain data.