Supplementary: Document Modeling with External Attention for Sentence Extraction

It is a challenging task to rely only on the main body of the document for extraction cues, as it requires document understanding. Documents in practice often have additional information, such as the title, image captions, videos, images and twitter handles, along with the main body of the document. These types of information are often available for newswire articles. Figure 1 shows an example of a newswire article taken from CNN (CNN.com). It shows the additional information such as the title (first block) and the images with their captions (third block) along with the main body of the document (second block). The last block shows a manually written summary of the document in terms of “highlights” to allow readers to quickly gather information on stories. As one can see in this example, gold highlights focus on sentences from the fourth paragraph, i.e., on key events such as the “PM’s resignation”, “bribery scandal and its investigation”, “suicide” and “leaving an important note”. Interestingly, the essence of the article is explicitly or implicitly mentioned in the title and the image captions of the document.


Introduction
Recurrent neural networks have become one of the most widely used models in natural language processing (NLP). A number of variants of RNNs such as Long Short-Term Memory networks (LSTM; Hochreiter and Schmidhuber, 1997) and Gated Recurrent Unit networks (GRU; Cho et al., 2014) have been designed to model text capturing long-term dependencies in problems such as language modeling. However, document modeling, a key to many natural language understanding tasks, is still an open challenge. Recently, some neural network architectures were proposed to capture large context for modeling text (Mikolov and Zweig, 2012;Ghosh et al., 2016;Ji et al., 2015;Wang and Cho, 2016). Lin et al. (2015) and Yang et al. (2016) proposed a hierarchical RNN network for document-level modeling as well as sentence-level modeling, at the cost of increased computational complexity. Tran et al. (2016) further proposed a contextual language model that considers information at interdocument level.
It is challenging to rely only on the document for its understanding, and as such it is not surprising that these models struggle on problems such as document summarization (Cheng and Lapata, 2016;Chen et al., 2016;Nallapati et al., 2017;See et al., 2017;Tan and Wan, 2017) and machine reading comprehension (Trischler et al., 2016;Miller et al., 2016;Weissenborn et al., 2017;Hu et al., 2017;. In this paper, we formalize the use of external information to further guide document modeling for end goals. We present a simple yet effective document modeling framework for sentence extraction that allows machine reading with "external attention." Our model includes a neural hierarchical document encoder (or a machine reader) and a hierarchical attention-based sentence extractor. Our hierarchical document encoder resembles the architectures proposed by Cheng and Lapata (2016) and Narayan et al. (2018) in that it derives the document meaning representation from its sentences and their constituent words. Our novel sentence extractor combines this document meaning representation with an attention mechanism (Bahdanau et al., 2015) over the external information to label sentences from the input document. Our model explicitly biases the extractor with external cues and implicitly biases the encoder through training.
We demonstrate the effectiveness of our model on two problems that can be naturally framed as sentence extraction with external information. These two problems, extractive document summarization and answer selection for machine reading comprehension, both require local and global contextual reasoning about a given document. Extractive document summarization systems aim at creating a summary by identifying (and subsequently concatenating) the most important sentences in a document, whereas answer selection systems select the candidate sentence in a document most likely to contain the answer to a query. For document summarization, we exploit the title and image captions which often appear with documents (specifically newswire articles) as external information. For answer selection, we use word overlap features, such as the inverse sentence frequency (ISF, Trischler et al., 2016) and the inverse document frequency (IDF) together with the query, all formulated as external cues.
Our main contributions are three-fold: First, our model ensures that sentence extraction is done in a larger (rich) context, i.e., the full document is read first before we start labeling its sentences for extraction, and each sentence labeling is done by implicitly estimating its local and global relevance to the document and by directly attending to some external information for importance cues.
Second, while external information has been shown to be useful for summarization systems using traditional hand-crafted features (Edmundson, 1969;Kupiec et al., 1995;Mani, 2001), our model is the first to exploit such information in deep learning-based summarization. We evaluate our models automatically (in terms of ROUGE scores) on the CNN news highlights dataset (Hermann et al., 2015). Experimental results show that our summarizer, informed with title and image captions, consistently outperforms summarizers that do not use this information. We also conduct a human evaluation to judge which type of summary participants prefer. Our results overwhelmingly show that human subjects find our summaries more informative and complete.
Lastly, with the machine reading capabilities of our model, we confirm that a full document needs to be "read" to produce high quality extracts allowing a rich contextual reasoning, in contrast to previous answer selection approaches that often measure a score between each sentence in the document and the question and return the sentence with highest score in an isolated manner (Yin et al., 2016;. Our model with ISF and IDF scores as external features achieves competitive results for answer selection. Our ensemble model combining scores from our model and word overlap scores using a logistic regression layer achieves state-ofthe-art results on the popular question answering datasets WikiQA  and NewsQA (Trischler et al., 2016), and it obtains comparable results to the state of the art for SQuAD (Rajpurkar et al., 2016). We also evaluate our approach on the MSMarco dataset (Nguyen et al., 2016) and elaborate on the behavior of our machine reader in a scenario where each candidate answer sentence is contextually independent of each other.

Document Modeling For Sentence Extraction
Given a document D consisting of a sequence of n sentences (s 1 , s 2 , ..., s n ) , we aim at labeling each sentence s i in D with a label y i ∈ {0, 1} where y i = 1 indicates that s i is extraction-worthy and 0 otherwise. Our architecture resembles those previously proposed in the literature (Cheng and Lapata, 2016;Nallapati et al., 2017). The main components include a sentence encoder, a document encoder, and a novel sentence extractor (see Fig-ure 1) that we describe in more detail below. The novel characteristics of our model are that each sentence is labeled by implicitly estimating its (local and global) relevance to the document and by directly attending to some external information for importance cues.
Sentence Encoder A core component of our model is a convolutional sentence encoder (Kim, 2014;Kim et al., 2016) which encodes sentences into continuous representations. We use temporal narrow convolution by applying a kernel filter K of width h to a window of h words in sentence s to produce a new feature. This filter is applied to each possible window of words in s to produce a feature map f ∈ R k−h+1 where k is the sentence length. We then apply max-pooling over time over the feature map f and take the maximum value as the feature corresponding to this particular filter K. We use multiple kernels of various sizes and each kernel multiple times to construct the representation of a sentence. In Figure 1, ker-nels of size 2 (red) and 4 (blue) are applied three times each. The max-pooling over time operation yields two feature lists f K 2 and f K 4 ∈ R 3 . The final sentence embeddings have six dimensions.

Document Encoder
The document encoder composes a sequence of sentences to obtain a document representation. We use a recurrent neural network with LSTM cells to avoid the vanishing gradient problem when training long sequences (Hochreiter and Schmidhuber, 1997). Given a document D consisting of a sequence of sentences (s 1 , s 2 , . . . , s n ), we follow common practice and feed the sentences in reverse order (Sutskever et al., 2014;Filippova et al., 2015).
Sentence Extractor Our sentence extractor sequentially labels each sentence in a document with 1 or 0 by implicitly estimating its relevance in the document and by directly attending to the external information for importance cues. It is implemented with another RNN with LSTM cells with an attention mechanism (Bahdanau et al., 2015) and a softmax layer. Our attention mechanism differs from the standard practice of attending intermediate states of the input (encoder). Instead, our extractor attends to a sequence of p pieces of external information E : (e 1 , e 2 , ..., e p ) relevant for the task (e.g., e i is a title or an image caption for summarization) for cues. At time t i , it reads sentence s i and makes a binary prediction, conditioned on the document representation (obtained from the document encoder), the previously labeled sentences and the external information. This way, our labeler is able to identify locally and globally important sentences within the document which correlate well with the external information. Given sentence s t at time step t, it returns a probability distribution over labels as: where g(·) is a single-layer neural network with parameters U o , V h and W h . h t is an intermedi-  ate RNN state at time step t. The dynamic context vector h t is essentially the weighted sum of the external information (e 1 , e 2 , . . . , e p ). Figure 1 summarizes our model.

Sentence Extraction Applications
We validate our model on two sentence extraction problems: extractive document summarization and answer selection for machine reading comprehension. Both these tasks require local and global contextual reasoning about a given document. As such, they test the ability of our model to facilitate document modeling using external information.
Extractive Summarization An extractive summarizer aims to produce a summary S by selecting m sentences from D (where m < n). In this setting, our sentence extractor sequentially predicts label y i ∈ {0, 1} (where 1 means that s i should be included in the summary) by assigning score p(y i |s i , D, E , θ) quantifying the relevance of s i to the summary. We assemble a summary S by selecting m sentences with top p(y i = 1|s i , D, E , θ) scores.
We formulate external information E as the sequence of the title and the image captions associated with the document. We use the convolutional sentence encoder to get their sentence-level representations.
Answer Selection Given a question q and a document D , the goal of the task is to select one candidate sentence s i ∈ D in which the answer exists. In this setting, our sentence extractor sequentially predicts label y i ∈ {0, 1} (where 1 means that s i contains the answer) and assign score p(y i |s i , D, E , θ) quantifying s i 's relevance to the query. We return as answer the sentence s i with the highest p(y i = 1|s i , D, E , θ) score.
We treat the question q as external information and use the convolutional sentence encoder to get its sentence-level representation. This simplifies Eq. (1) and (2) as follow: where V h and W q are network parameters. We exploit the simplicity of our model to further assimilate external features relevant for answer selection: the inverse sentence frequency (ISF, (Trischler et al., 2016)), the inverse document frequency (IDF) and a modified version of the ISF score which we call local ISF. Trischler et al. (2016) have shown that a simple ISF baseline (i.e., a sentence with the highest ISF score as an answer) correlates well with the answers. The ISF score α s i for the sentence s i is computed as α s i = w∈s i ∩q IDF(w), where IDF is the inverse document frequency score of word w, defined as: where N is the total number of sentences in the training set and N w is the number of sentences in which w appears. Note that, s i ∩ q refers to the set of words that appear both in s i and in q. Local ISF is calculated in the same manner as the ISF score, only with setting the total number of sentences (N ) to the number of sentences in the article that is being analyzed.
More formally, this modifies Eq. (3) as follows: where α t , β t and γ t are the ISF, IDF and local ISF scores (real values) of sentence s t respectively .
The function g is calculated as follows: where W isf , W idf and W lisf are new parameters added to the network and 1 is a vector of 1s of size equal to the sentence embedding size. In Figure  1, these external feature vectors are represented as 6-dimensional gray vectors accompanied with dashed arrows.

Experiments and Results
This section presents our experimental setup and results assessing our model in both the extractive summarization and answer selection setups. In the rest of the paper, we refer to our model as XNET for its ability to exploit eXternal information to improve document representation.

Extractive Document Summarization
Summarization Dataset We evaluated our models on the CNN news highlights dataset (Hermann et al., 2015). 2 We used the standard splits of Hermann et al. (2015) for training, validation, and testing (90,266/1,220/1,093 documents). We followed previous studies (Cheng and Lapata, 2016;Nallapati et al., 2016Nallapati et al., , 2017See et al., 2017;Tan and Wan, 2017) in assuming that the "story highlights" associated with each article are gold-standard abstractive summaries. We trained our network on a named-entity-anonymized version of news articles. However, we generated deanonymized summaries and evaluated them against gold summaries to facilitate human evaluation and to make human evaluation comparable to automatic evaluation. To train our model, we need documents annotated with sentence extraction information, i.e., each sentence in a document is labeled with 1 (summary-worthy) or 0 (not summary-worthy). We followed Nallapati et al. (2017) and automatically extracted ground truth labels such that all positively labeled sentences from an article collectively give the highest ROUGE (Lin and Hovy, 2003) score with respect to the gold summary.
We used a modified script of Hermann et al. (2015) to extract titles and image captions, and we associated them with the corresponding articles. All articles get associated with their titles. The availability of image captions varies from 0 to 414 per article, with an average of 3 image captions. There are 40% CNN articles with at least one image caption.
All sentences, including titles and image captions, were padded with zeros to a sentence length of 100. All input documents were padded with zeros to a maximum document length of 126. For each document, we consider a maximum of 10 image captions. We experimented with various numbers (1, 3, 5, 10 and 20) of image captions on the validation set and found that our model performed best with 10 image captions. We refer the reader to the supplementary material for more implementation details to replicate our results.

Comparison Systems
We compared the output of our model against the standard baseline of simply selecting the first three sentences from each document as the summary. We refer to this baseline as LEAD in the rest of the paper.
We also compared our system against the sentence extraction system of Cheng and Lapata (2016). We refer to this system as POINTERNET as the neural attention architecture in Cheng and Lapata (2016) resembles the one of Pointer Networks . 3 It does not exploit any external information. 4   The architecture of POINTERNET is closely related to our model without external information. 4 Adding external information to POINTERNET is an in- Automatic Evaluation To automatically assess the quality of our summaries, we used ROUGE (Lin and Hovy, 2003), a recall-oriented metric, to compare our model-generated summaries to manually-written highlights. 6 Previous work has reported ROUGE-1 (R1) and ROUGE-2 (R2) scores to access informativeness, and ROUGE-L (RL) to access fluency. In addition to R1, R2 and RL, we also report ROUGE-3 (R3) and ROUGE-4 (R4) capturing higher order n-grams overlap to assess informativeness and fluency simultaneously.
teresting direction of research but we do not pursue it here. It requires decoding with multiple types of attentions and this is not the focus of this paper. 5 We are unable to compare our results to the extractive system of Nallapati et al. (2017) because they report their results on the DailyMail dataset and their code is not available. The abstractive systems of Chen et al. (2016) and Tan and Wan (2017) report their results on the CNN dataset, however, their results are not comparable to ours as they report on the full-length F1 variants of ROUGE to evaluate their abstractive summaries. We report ROUGE recall scores which is more appropriate to evaluate our extractive summaries. 6 We used pyrouge, a Python package, to compute all our ROUGE scores with parameters "-a -c 95 -m -n 4 -w 1.2." We report our results on both full length (three sentences with the top scores as the summary) and fixed length (first 75 bytes and 275 bytes as the summary) summaries. For full length summaries, our decision of selecting three sentences is guided by the fact that there are 3.11 sentences on average in the gold highlights of the training set. We conduct our ablation study on the validation set with full length ROUGE scores, but we report both fixed and full length ROUGE scores for the test set.
We experimented with two types of external information: title (TITLE) and image captions (CAPTION). In addition, we experimented with the first sentence (FS) of the document as external information. Note that the latter is not external information, it is a sentence in the document. However, we wanted to explore the idea that the first sentence of the document plays a crucial part in generating summaries (Rush et al., 2015;Nallapati et al., 2016). XNET with FS acts as a baseline for XNET with title and image captions.
We report the performance of several variants of XNET on the validation set in Table 1. We also compare them against the LEAD baseline and POINTERNET. These two systems do not use any additional information. Interestingly, all the variants of XNET significantly outperform LEAD and POINTERNET. When the title (TITLE), image captions (CAPTION) and the first sentence (FS) are used separately as additional information, XNET performs best with TITLE as its external information. Our result demonstrates the importance of the title of the document in extractive summarization (Edmundson, 1969;Kupiec et al., 1995;Mani, 2001). The performance with TITLE and CAP-TION is better than that with FS. We also tried possible combinations of TITLE, CAPTION and FS. All XNET models are superior to the ones without any external information. XNET performs best when TITLE and CAPTION are jointly used as external information (55.4%, 21.8%, 11.8%, 7.5%, and 49.2% for R1, R2, R3, R4, and RL respectively). It is better than the the LEAD baseline by 3.7 points on average and than POINTERNET by 1.8 points on average, indicating that external information is useful to identify the gist of the document. We use this model for testing purposes.
Our final results on the test set are shown in   to XNET. This result could be because LEAD (always) and POINTERNET (often) include the first sentence in their summaries, whereas, XNET is better capable at selecting sentences from various document positions. This is not captured by smaller summaries of 75 bytes, but it becomes more evident with longer summaries (275 bytes and full length) where XNET performs best across all ROUGE scores. We note that POINTERNET outperforms LEAD for 75-byte summaries, then its performance drops behind LEAD for 275-byte summaries, but then it outperforms LEAD for full length summaries on the metrics R1, R2 and RL. It shows that POINTERNET with its attention over sentences in the document is capable of exploring more than first few sentences in the document, but it is still behind XNET which is better at identifying salient sentences in the document. XNET performs significantly better than POINTERNET by 0.8 points for 275-byte summaries and by 1.9 points for full length summaries, on average for all ROUGE scores.

Human Evaluation
We complement our automatic evaluation results with human evaluation. We randomly selected 20 articles from the test set.
Annotators were presented with a news article and summaries from four different systems. These include the LEAD baseline, POINTERNET, XNET and the human authored highlights. We followed the guidelines in Cheng and Lapata (2016), and asked our participants to rank the summaries from best (1st) to worst (4th) in order of informativeness (does the summary capture important information in the article?) and fluency (is the summary written in well-formed English?). We did not allow any ties and we only sampled articles with nonidentical summaries. We assigned this task to five annotators who were proficient English speakers. Each annotator was presented with all 20 articles. The order of summaries to rank was randomized per article. An example of summaries our subjects ranked is provided in the supplementary material.
The results of our human evaluation study are shown in Table 3. As one might imagine, HUMAN gets ranked 1st most of the time (41%). However, it is closely followed by XNET which ranked 1st 28% of the time. In comparison, POINTER-NET and LEAD were mostly ranked at 3rd and 4th places. We also carried out pairwise comparisons between all models in Table 3 for their statistical significance using a one-way ANOVA with post-hoc Tukey HSD tests with (p < 0.01). It showed that XNET is significantly better than LEAD and POINTERNET, and it does not differ significantly from HUMAN. On the other hand, POINTERNET does not differ significantly from LEAD and it differs significantly from both XNET and HUMAN. The human evaluation results corroborates our empirical results in Table 1 and Table 2: XNET is better than LEAD and POINT-ERNET in producing informative and fluent summaries.
NewsQA was especially designed to present lexical and syntactic divergence between questions and answers. It contains 119,633 questions posed by crowdworkers on 12,744 CNN articles previously collected by Hermann et al. (2015). In a similar manner, SQuAD associates 100,000+ question with a Wikipedia article's first paragraph, for 500+ previously chosen articles. WikiQA was collected by mining web-searching query logs and then associating them with the summary section of the Wikipedia article presumed to be related to the topic of the query. A similar collection procedure was followed to create MSMarco with the difference that each candidate answer is a whole paragraph from a different browsed website associated with the query.
We follow the widely used setup of leaving out unanswered questions (Trischler et al., 2016; and adapt the format of each dataset to our task of answer sentence selection by labeling a candidate sentence with 1 if any answer span is contained in that sentence. In the case of MS-Marco, each candidate paragraph comes associated with a label, hence we treat each one as a single long sentence. Since SQuAD keeps the official test dataset hidden and MSMarco does not provide labels for its released test set, we report results on their official validation sets. For validation, we set apart 10% of each official training set.

Comparison Systems
We compared the output of our model against the ISF (Trischler et al., 2016) and LOCALISF baselines. Given an article, the sentence with the highest ISF score is selected as an answer for the ISF baseline and the sentence with the highest local ISF score for the LOCALISF baseline. We also compare our model against a neural network (PAIRCNN) that encodes (question, candidate) in an isolated manner as in previous work (Yin et al., 2016;. The architecture uses the sentence encoder explained in earlier sections to learn the question and candidate representations. The distribution over labels is given by p(y t |q) = p(y t |s t , q) = softmax(g(s t , q)) where g(s t , q) = ReLU(W sq · [s t ; q] + b sq ). In addition, we also compare our model against AP-CNN (dos Santos et al., 2016), ABCNN (Yin et al., 2016), L.D.C (Wang and Jiang, 2017), KV-MemNN (Miller et al., 2016), and COMPAGGR, a state-of-the-art system by .
We experiment with several variants of our model. XNET is the vanilla version of our sen-   (Yin et al., 2016), L.D.C (Wang and Jiang, 2017), KV-MemNN (Miller et al., 2016), and COMPAGGR, a state-of-the-art system by . (WGT) WRD CNT stands for the (weighted) word count baseline. See text for more details.
tence extractor conditioned only on the query q as external information (Eq. (3)). XNET+ is an extension of XNET which uses ISF, IDF and local ISF scores in addition to the query q as external information (Eqn. (4)). We also experimented with a baseline XNETTOPK where we choose the top k sentences with highest ISF score, and then among them choose the one with the highest probability according to XNET. In our experiments, we set k = 5. In the end, we experimented with an ensemble network LRXNET which combines the XNET score, the COMPAGGR score and other word-overlap-based scores (tweaked and optimized for each dataset separately) for each sentence using a logistic regression classifier. It uses ISF and LocalISF scores for NewsQA, IDF and ISF scores for SQuAD, sentence length, IDF and ISF scores for WikiQA, and word overlap and ISF score for MSMarco. We refer the reader to the supplementary material for more implementation and optimization details to replicate our results.

Evaluation Metrics
We consider metrics that evaluate systems that return a ranked list of candidate answers: mean average precision (MAP), mean reciprocal rank (MRR), and accuracy (ACC).
Results Table 4 gives the results for the test sets of NewsQA and WikiQA, and the original validation sets of SQuAD and MSMarco. Our first observation is that XNET outperforms PAIRCNN, supporting our claim that it is beneficial to read the whole document in order to make decisions, instead of only observing each candidate in isolation. Secondly, we can observe that ISF is indeed a strong baseline that outperforms XNET. This means that just "reading" the document using a vanilla version of XNET is not sufficient, and help is required through a coarse filtering. Indeed, we observe that XNET+ outperforms all baselines except for COMPAGGR. Our ensemble model LRXNET can ultimately surpass COMPAGGR on majority of the datasets.
This consistent behavior validates the machine reading capabilities and the improved document representation with external features of our model for answer selection. Specifically, the combination of document reading and word overlap features is required to be done in a soft manner, using a classification technique. Using it as a hard constraint, with XNETTOPK, does not achieve the best result. We believe that often the ISF score is a better indicator of answer presence in the vicinity of certain candidate instead of in the candidate itself. As such, XNET+ is capable of using this feature in datasets with richer context.
It is worth noting that the improvement gained by LRXNET over the state-of-the-art follows a pattern. For the SQuAD dataset, the results are comparable (less than 1%). However, the improvement for WikiQA reaches ∼3% and then the gap shrinks again for NewsQA, with an improvement of ∼1%. This could be explained by the fact that each sample of the SQuAD is a paragraph, compared to an article summary for WikiQA, and to an entire article for NewsQA. Hence, we further strengthen our hypothesis that a richer context is needed to achieve better results, in this case expressed as document length, but as the length of the context increases the limitation of sequential models to learn from long rich sequences arises. 7 Interestingly, our model lags behind COM-PAGGR on the MSMarco dataset. It turns out this is due to contextual independence between candidates in the MSMarco dataset, i.e., each candidate is a stand-alone paragraph in this dataset, in contrast to contextually dependent candidate sentences from a document in the NewsQA, SQuAD and WikiQA datasets. As a result, our models (XNET+ and LRXNET) with document reading abilities perform poorly. This can be observed by the fact that XNET and PAIRCNN obtain comparable results. COMPAGGR performs better because comparing each candidate independently is a better strategy.

Conclusion
We describe an approach to model documents while incorporating external information that informs the representations learned for the sentences in the document. We implement our approach through an attention mechanism of a neural network architecture for modeling documents.
Our experiments with extractive document summarization and answer selection tasks validates our model in two ways: first, we demonstrate that external information is important to guide document modeling for natural language understanding tasks. Our model uses image captions and the title of the document for document summarization, and the query with word overlap features for answer selection and outperforms its counterparts that do not use this information. Second, our external attention mechanism successfully guides the learning of the document representation for the relevant end goal. For answer selection, we show that inserting the query with word overlap features using our external attention mechanism outperforms state-of-the-art systems that naturally also have access to this information.