Entity Commonsense Representation for Neural Abstractive Summarization

A major proportion of a text summary includes important entities found in the original text. These entities build up the topic of the summary. Moreover, they hold commonsense information once they are linked to a knowledge base. Based on these observations, this paper investigates the usage of linked entities to guide the decoder of a neural text summarizer to generate concise and better summaries. To this end, we leverage on an off-the-shelf entity linking system (ELS) to extract linked entities and propose Entity2Topic (E2T), a module easily attachable to a sequence-to-sequence model that transforms a list of entities into a vector representation of the topic of the summary. Current available ELS’s are still not sufficiently effective, possibly introducing unresolved ambiguities and irrelevant entities. We resolve the imperfections of the ELS by (a) encoding entities with selective disambiguation, and (b) pooling entity vectors using firm attention. By applying E2T to a simple sequenceto-sequence model with attention mechanism as base model, we see significant improvements of the performance in the Gigaword (sentence to title) and CNN (long document to multi-sentence highlights) summarization datasets by at least 2 ROUGE points.


Introduction
Text summarization is a task to generate a shorter and concise version of a text while preserving the meaning of the original text. The task can be divided into two subtask based on the approach: extractive and abstractive summarization. Extractive summarization is a task to create summaries by pulling out snippets of text form the original text and combining them to form a summary. Abstractive summarization asks to generate summaries from scratch without the restriction to use * Amplayo and Lim are co-first authors with equal contribution. Names are arranged alphabetically. the available words from the original text. Due to the limitations of extractive summarization on incoherent texts and unnatural methodology (Yao et al., 2017), the research trend has shifted towards abstractive summarization. Sequence-to-sequence models  with attention mechanism (Bahdanau et al., 2014) have found great success in generating abstractive summaries, both from a single sentence (Chopra et al., 2016) and from a long document with multiple sentences (Chen et al., 2016). However, when generating summaries, it is necessary to determine the main topic and to sift out unnecessary information that can be omitted. Sequenceto-sequence models have the tendency to include all the information, relevant or not, that are found in the original text. This may result to unconcise summaries that concentrates wrongly on irrelevant topics. The problem is especially severe when summarizing longer texts.
In this paper, we propose to use entities found in the original text to infer the summary topic, miti-gating the aforementioned problem. Specifically, we leverage on linked entities extracted by employing a readily available entity linking system. The importance of using linked entities in summarization is intuitive and can be explained by looking at Figure 1 as an example. First (O1 in the Figure), aside from auxiliary words to construct a sentence, a summary is mainly composed of linked entities extracted from the original text. Second (O2), we can depict the main topic of the summary as a probability distribution of relevant entities from the list of entities. Finally (O3), we can leverage on entity commonsense learned from a separate large knowledge base such as Wikipedia.
To this end, we present a method to effectively apply linked entities in sequence-tosequence models, called Entity2Topic (E2T). E2T is a module that can be easily attached to any sequence-to-sequence based summarization model. The module encodes the entities extracted from the original text by an entity linking system (ELS), constructs a vector representing the topic of the summary to be generated, and informs the decoder about the constructed topic vector. Due to the imperfections of current ELS's, the extracted linked entities may be too ambiguous and coarse to be considered relevant to the summary. We solve this issue by using entity encoders with selective disambiguation and by constructing topic vectors using firm attention.
We experiment on two datasets, Gigaword and CNN, with varying lengths. We show that applying our module to a sequence-to-sequence model with attention mechanism significantly increases its performance on both datasets. Moreover, when compared with the state-of-the-art models for each dataset, the model obtains a comparable performance on the Gigaword dataset where the texts are short, and outperforms all competing models on the CNN dataset where the texts are longer. Furthermore, we provide analysis on how our model effectively uses the extracted linked entities to produce concise and better summaries.

Usefulness of linked entities in summarization
In the next subsections, we present detailed arguments with empirical and previously examined evidences on the observations and possible issues when using linked entities extracted by an entity linking system (ELS) for generating abstractive summaries. For this purpose, we use the development sets of the Gigaword dataset provided in (Rush et al., 2015) and of the CNN dataset provided in (Hermann et al., 2015) as the experimental data for quantitative evidence and refer the readers to Figure 1 as the running example.

Observations
As discussed in Section 1, we find three observations that show the usefulness of linked entities for abstractive summarization. First, summaries are mainly composed of linked entities extracted from the original text. In the example, it can be seen that the summary contains four words that refer to different entities. In fact, all noun phrases in the summary mention at least one linked entity. In our experimental data, we extract linked entities from the original text and compare them to the noun phrases found in the summary. We report that 77.1% and 75.1% of the noun phrases on the Gigaword and CNN datasets, respectively, contain at least one linked entity, which confirms our observation.
Second, linked entities can be used to represent the topic of the summary, defined as a multinomial distribution over entities, as graphically shown in the example, where the probabilities refer to the relevance of the entities. Entities have been previously used to represent topics (Newman et al., 2006), as they can be utilized as a controlled vocabulary of the main topics in a document (Hulpus et al., 2013). In the example, we see that the entity "Jae Seo" is the most relevant because it is the subject of the summary, while the entity "South Korean" is less relevant because it is less important when constructing the summary.
Third, we can make use of the entity commonsense that can be learned as a continuous vector representation from a separate larger corpus (Ni et al., 2016;Yamada et al., 2017). In the example, if we know that the entities "Los Angeles Dodgers" and "New York Mets" are American baseball teams and "Jae Seo" is a baseball player associated with the teams, then we can use this information to generate more coherent summaries. We find that 76.0% of the extracted linked entities are covered by the pre-trained vectors 1 in our experimental data, proving our third observation.

Possible issues
Despite its usefulness, linked entities extracted from ELS's have issues because of low precision rates (Hasibi et al., 2016) and design challenges in training datasets (Ling et al., 2015). These issues can be summarized into two parts: ambiguity and coarseness.
First, the extracted entities may be ambiguous. In the example, the entity "South Korean" is ambiguous because it can refer to both the South Korean person and the South Korean language, among others 2 . In our experimental data, we extract (1) the top 100 entities based on frequency, and (2) the entities extracted from 100 randomly selected texts, and check whether they have disambiguation pages in Wikipedia or not. We discover that 71.0% of the top 100 entities and 53.6% of the entities picked at random have disambiguation pages, which shows that most entities are prone to ambiguity problems.
Second, the linked entities may also be too common to be considered an entity. This may introduce errors and irrelevance to the summary. In the example, "Wednesday" is erroneous because it is wrongly linked to the entity "Wednesday Night Baseball". Also, "swap" is irrelevant because although it is linked correctly to the entity "Trade (Sports)", it is too common and irrelevant when generating the summaries. In our experimental data, we randomly select 100 data instances and tag the correctness and relevance of extracted entities into one of four labels: A: correct and relevant, B: correct and somewhat relevant, C: correct but irrelevant, and D: incorrect. Results show that 29.4%, 13.7%, 30.0%, and 26.9% are tagged with A, B, C, and D, respectively, which shows that there is a large amount of incorrect and irrelevant entities.

Our model
To solve the issues described above, we present Entity2Topic (E2T), a module that can be easily attached to any sequence-to-sequence based abstractive summarization model. E2T encodes the linked entities extracted from the text and transforms them into a single topic vector. This vector is ultimately concatenated to the decoder hidden state vectors. The module contains two submodules specifically for the issues presented by the en-tity linking systems: the entity encoding submodule with selective disambiguation and the pooling submodule with firm attention.
Overall, our full architecture can be illustrated as in Figure 2, which consists of an entity linking system (ELS), a sequence-to-sequence with attention mechanism model, and the E2T module. We note that our proposed module can be easily attached to more sophisticated abstractive summarization models (Zhou et al., 2017;Tan et al., 2017) that are based on the traditional encoderdecoder framework and consequently can produce better results. The code of the base model and the E2T are available online 3 .

Base model
As our base model, we employ a basic encoderdecoder RNN used in most neural machine translation (Bahdanau et al., 2014) and text summarization (Nallapati et al., 2016) tasks. We employ a two-layer bidirectional GRU (BiGRU) as the recurrent unit of the encoder. The BiGRU consists of a forward and backward GRU, which results to sequences of forward and backward hidden states The forward and backward hidden states are concatenated to get the hidden state vectors of the . The final states of the forward and backward GRU are also concatenated to create the final text representation vector of the encoder s . These values are calculated per layer, where x t of the second layer is h t of the first layer. The final text representation vectors are projected by a fully connected layer and are passed to the decoder as the initial hidden states s 0 = s.
For the decoder, we use a two-layer unidirectional GRU with attention. At each time step t, the previous token y t−1 , the previous hidden state s t−1 , and the previous context vector c t−1 are passed to a GRU to calculate the new hidden state s t , as shown in the equation below.  The context vector c t is computed using the additive attention mechanism (Bahdanau et al., 2014), which matches the current decoder state s t and each encoder state h i to get an importance score. The scores are then passed to a softmax and are used to pool the encoder states using weighted sum. The final pooled vector is the context vector, as shown in the equations below.

The Los Angeles Dodgers acquired South Korean right-hander Jae Seo from the New York Mets on Wednesday in a four-player swap.
Finally, the previous token y t−1 , the current context vector c t , and the current decoder state s t are used to generate the current word y t with a softmax layer over the decoder vocabulary, as shown below.

Entity encoding submodule
After performing entity linking to the input text using the ELS, we receive a sequential list of linked entities, arranged based on their location in the text. We embed these entities to d-dimensional vectors E = {e 1 , e 2 , ..., e m } where e i ∈ R d . Since these entities may still contain ambiguity, it is necessary to resolve them before applying them to the base model. Based on the idea that an ambiguous entity can be disambiguated using its neighboring entities, we introduce two kinds of disambiguating encoders below.
Globally disambiguating encoder One way to disambiguate an entity is by using all the other entities, putting more importance to entities that are nearer. For this purpose, we employ an RNNbased model to globally disambiguate the entities. Specifically, we use BiGRU and concatenate the forward and backward hidden state vectors as the new entity vector: Locally disambiguating encoder Another way to disambiguate an entity is by using only the direct neighbors of the entity, putting no importance value to entities that are far. To do this, we employ a CNN-based model to locally disambiguate the entities. Specifically, we do the convolution operation using filter matrices W f ∈ R h×d with filter size h to a window of h words. We do this for different sizes of h. This produces new feature vectors c i,h as shown below, where f (.) is a non-linear function: The convolution operation reduces the number of entities differently depending on the filter size h. To prevent loss of information and to produce the same amount of feature vectors c i,h , we pad the entity list dynamically such that when the filter size is h, the number of paddings on each side is (h − 1)/2. The filter size h therefore refers to the number of entities used to disambiguate a middle entity. Finally, we concatenate all feature vectors The question on which disambiguating encoder is better has been a debate; some argued that using only the local context is appropriate (Lau et al., 2013) while some claimed that additionally using global context also helps (Wang et al., 2015). The RNN-based encoder is good as it smartly makes use of all entities, however it may perform bad when there are many entities as it introduces noise when using a far entity during disambiguation. The CNN-based encoder is good as it minimizes the noise by totally ignoring far entities when disambiguating, however determining the appropriate filter sizes h needs engineering. Overall, we argue that when the input text is short (e.g. a sentence), both encoders perform comparably, otherwise when the input text is long (e.g. a document), the CNN-based encoder performs better.
Selective disambiguation It is obvious that not all entities need to be disambiguated. When a correctly linked and already adequately disambiguated entity is disambiguated again, it would make the entity very context-specific and might not be suitable for the summarization task. Our entity encoding submodule therefore uses a selective mechanism that decides whether to use the disambiguating encoder or not. This is done by introducing a selective disambiguation gate d. The final entity vectorẽ i is calculated as the linear transfor-mation of e i and e i : The full entity encoding submodule is illustrated in Figure 3. Ultimately, the submodule outputs the disambiguated entity vectorsẼ = {ẽ 1 ,ẽ 2 , ...,ẽ m }.

Pooling submodule
The entity vectorsẼ are pooled to create a single topic vector t that represents the topic of the summary. One possible pooling technique is to use soft attention (Xu et al., 2015) on the vectors to determine the importance value of each vector, which can be done by matching each entity vector with the text vector s from the text encoder as the context vector. The entity vectors are then pooled using weighted sum. One problem with soft attention is that it considers all entity vectors when constructing the topic vector. However, not all entities are important and necessary when generating summaries. Moreover, a number of these entities may be erroneous and irrelevant, as reported in Section 2.2. Soft attention gives non-negligible important scores to these entities, thus adds unnecessary noise to the construction of the topic vector.
Our pooling submodule instead uses firm attention mechanism to consider only top k entities when constructing the topic vector. This is done in a differentiable way as follows: where the functions K = top k(G) gets the indices of the top k vectors in G and P = sparse vector(K, 0, −∞) creates a sparse vector where the values of K is 0 and −∞ otherwise 4 . The sparse vector P is added to the original importance score vector G to create a new importance score vector. In this new vector, important scores of non-top k entities are −∞. When softmax is applied, this gives very small, negligible, and closeto-zero values to non-top k entities. The value k depends on the lengths of the input text and summary. Moreover, when k increases towards infinity, firm attention becomes soft attention. We decide k empirically (see Section 5).

Extending from the base model
Entity2Topic module extends the base model as follows. The final text representation vector s is used as a context vector when constructing the topic vector t in the pooling submodule. The topic vector t is then concatenated to the decoder hidden state vectors s i , i.e. s i = [s i ; t]. The concatenated vector is finally used to create the output vector: Related work Due to its recent success, neural network models have been used with competitive results on abstractive summarization. A neural attention model was first applied to the task, easily achieving stateof-the-art performance on multiple datasets (Rush et al., 2015). The model has been extended to instead use recurrent neural network as decoder (Chopra et al., 2016). The model was further extended to use a full RNN encoder-decoder framework and further enhancements through lexical and statistical features (Nallapati et al., 2016). The current state-of-the-art performance is achieved by selectively encoding words as a process of distilling salient information (Zhou et al., 2017).
Neural abstractive summarization models have also been explored to summarize longer documents. Word extraction models have been previously explored, performing worse than sentence extraction models (Cheng and Lapata, 2016). Hierarchical attention-based recurrent neural networks have also been applied to the task, owing to the idea that there are multiple sentences in a document (Nallapati et al., 2016). Finally, distractionbased models were proposed to enable models to traverse the text content and grasp the overall meaning (Chen et al., 2016). The current state-ofthe-art performance is achieved by a graph-based attentional neural model, considering the key factors of document summarization such as saliency, fluency and novelty (Tan et al., 2017). Previous studies on the summarization tasks have only used entities in the preprocessing stage to anonymize the dataset (Nallapati et al., 2016) and to mitigate out-of-vocabulary problems (Tan et al., 2017). Linked entities for summarization are still not properly explored and we are the first to use linked entities to improve the performance of the summarizer.

Experimental settings
Datasets We use two widely used summarization datasets with different text lengths. First, we use the Annotated English Gigaword dataset as used in (Rush et al., 2015). This dataset receives the first sentence of a news article as input and use the headline title as the gold standard summary. Since the development dataset is large, we randomly selected 2000 pairs as our development dataset. We use the same held-out test dataset used in (Rush et al., 2015) for comparison. Second, we use the CNN dataset released in (Hermann et al., 2015). This dataset receives the full news article as input and use the human-generated multiple sentence highlight as the gold standard summary. The original dataset has been modified and preprocessed specifically for the document summarization task (Nallapati et al., 2016). In addition to the previously provided datasets, we extract linked entities using Dexter 5 (Ceccarelli et al., 2013), an open source ELS that links text snippets found in a given text to entities contained in Wikipedia. We use the default recommended parameters stated in the website. We summarize the statistics of both datasets in Table 1.
Implementation For both datasets, we further reduce the size of the input, output, and entity vocabularies to at most 50K as suggested in (See et al., 2017) and replace less frequent words to "<unk>". We use 300D Glove 6 (Pennington et al., 2014) and 1000D wiki2vec 7 pre-trained vectors to initialize our word and entity vectors. For GRUs, we set the state size to 500. For CNN, we set h = 3, 4, 5 with 400, 300, 300 feature maps, respectively. For firm attention, k is tuned by calculating the perplexity of the model starting with smaller values (i.e. k = 1, 2, 5, 10, 20, ...) and stopping when the perplexity of the model becomes worse than the previous model. Our preliminary tuning showed that k = 5 for Gigaword dataset and k = 10 for CNN dataset are the best choices. We use dropout (Srivastava et al., 2014) on all non-linear connections with a dropout rate of 0.5. We set the batch sizes of Gigaword and CNN datasets to 80 and 10, respectively. Training is done via stochastic gradient descent over shuffled mini-batches with the Adadelta update rule, with l 2 constraint (Hinton et al., 2012) of 3. We perform early stopping using a subset of the given development dataset. We use beam search of size 10 to generate the summary.
Baselines For the Gigaword dataset, we compare our models with the following abstractive baselines: ABS+ (Rush et al., 2015) is a fine tuned version of ABS which uses an attentive CNN encoder and an NNLM decoder, Feat2s (Nallapati et al., 2016) is an RNN sequence-to-sequence model with lexical and statistical features in the encoder, Luong-NMT (Luong et al., 2015) is a two-layer LSTM encoder-decoder model, RAS-Elman (Chopra et al., 2016) uses an attentive CNN encoder and an Elman RNN decoder, and SEASS (Zhou et al., 2017) uses BiGRU encoders and GRU decoders with selective encoding. For the CNN dataset, we compare our models with the following extractive and abstractive baselines: Lead-3 is a strong baseline that extracts the first three sentences of the document as summary, LexRank extracts texts using LexRank (Erkan and Radev, 2004), Bi-GRU is a non-hierarchical one-layer sequence-to-sequence abstractive baseline, Distraction-M3 (Chen et al., 2016) uses a sequence-to-sequence abstractive model with distraction-based networks, and GBA (Tan et al., 2017) is a graph-based attentional neural abstractive model. All baseline results used beam search and are gathered from previous papers. Also, 6 https://nlp.stanford.edu/projects/ glove/ 7 https://github.com/idio/wiki2vec   we compare our final model BASE+E2T with the base model BASE and some variants of our model (without selective disambiguation, using soft attention).

Results
We report the ROUGE F1 scores for both datasets of all the competing models using ROUGE F1 scores (Lin, 2004). We report the results on the Gigaword and the CNN dataset in Table 2 and Table 3, respectively. In Gigaword dataset where the texts are short, our best model achieves a comparable performance with the current state-of-the-art.
In CNN dataset where the texts are longer, our best model outperforms all the previous models. We emphasize that E2T module is easily attachable to better models, and we expect E2T to improve  their performance as well. Overall, E2T achieves a significant improvement over the baseline model BASE, with at least 2 ROUGE-1 points increase in the Gigaword dataset and 6 ROUGE-1 points increase in the CNN dataset. In fact, all variants of E2T gain improvements over the baseline, implying that leveraging on linked entities improves the performance of the summarizer. Among the model variants, the CNN-based encoder with selective disambiguation and firm attention performs the best. Automatic evaluation on the Gigaword dataset shows that the CNN and RNN variants of BASE+E2T have similar performance. To break the tie between both models, we also conduct human evaluation on the Gigaword dataset. We instruct two annotators to read the input sentence and rank the competing summaries from first to last according to their relevance and fluency: (a) the original summary GOLD, and from models (b) BASE, (c) BASE+E2T cnn , and (d) BASE+E2T rnn . We then compute (i) the proportion of every ranking of each model and (ii) the mean rank of each model. The results are reported in Table 4. The model with the best mean rank is BASE+E2T cnn , followed by GOLD, then by BASE+E2T rnn and BASE, respectively. We also perform ANOVA and post-hoc Tukey tests to show that the CNN variant is significantly (p < 0.01) better than the RNN variant and the base model. The RNN variant does not perform as well as the CNN variant, contrary to the automatic ROUGE evaluation above. Interestingly, the CNN variant produces better (but with no significant difference) summaries than the gold summaries. We posit that this is due to the fact that the article title does not correspond to the summary of the first sentence.
Selective disambiguation of entities We show the effectiveness of the selective disambiguation gate d in selecting which entities to disambiguate or not. Table 6 shows a total of four different examples of two entities with the highest/lowest d values. In the first example, sentence E1.1 contains the entity "United States" and is linked with the country entity of the same name, however the correct linked entity should be "United States Davis Cup team", and therefore is given a high d value. On the other hand, sentence E1.2 is linked correctly to the country "United States", and thus is given a low d value.. The second example provides a similar scenario, where sentence E2.1 is linked to the entity "Gold" but should be linked to the entity "Gold medal". Sentence E2.2 is linked correctly to the chemical element. Hence, the former case received a high value d while the latter case received a low d value.
Entities as summary topic Finally, we provide one sample for each dataset in Table 5 for case study, comparing our final model that uses firm attention (BASE cnn+sd ), a variant that uses soft attention (BASE cnn+soft ), and the baseline model (BASE). We also show the attention weights of the firm and soft models.
In the Gigaword example, we find three observations. First, the base model generated a less informative summary, not mentioning "mexico state" and "first edition". Second, the soft model produced a factually wrong summary, saying that "guadalajara" is a mexican state, while actually it is a city. Third, the firm model is able to solve the problem by focusing only on the five most important entities, eliminating possible noise such as "Unk" and less crucial entities such as "Country club". We can also see the effectiveness of the selective disambiguation in this example, where the entity "U.S. state" is corrected to mean the entity "Mexican state" which becomes relevant and is therefore selected.
In the CNN example, we also find that the baseline model generated a very erroneous summary. We argue that this is because the length of the input text is long and the decoder is not guided as to which topics it should focus on. The soft model generated a much better summary, however it focuses on the wrong topics, specifically on "Iran's nuclear program", making the summary less general. A quick read of the original article tells us that the main topic of the article is all about the two political parties arguing over the deal with Iran. However, the entity "nuclear" appeared a lot in the article, which makes the soft model wrongly focus on the "nuclear" entity. The firm model produced the more relevant summary, focusing on the po-Gigaword Dataset Example Original western mexico @state @jalisco will host the first edition of the @UNK dollar @lorena ochoa invitation @golf tournament on nov. ##-## #### , in @guadalajara @country club , the @lorena ochoa foundation said in a statement on wednesday . Gold netanyahu says third option is " standing firm " to get a better deal . political sparring continues in u.s. over the deal with iran .
Baseline netanyahu says he is a country of " UNK cheating " and that it is a country of " UNK cheating " netanyahu says he is a country of " UNK cheating " and that " is a very bad deal " he says he says he says the plan is a country of " UNK cheating " and that it is a country of " UNK cheating " he says the u.s. is a country of " UNK cheating " and that is a country of " UNK cheating " Soft benjamin netanyahu : " i think there 's a third alternative , and that is standing firm , " netanyahu tells cnn . he says he does not roll back iran 's nuclear ambitions . " it does not roll back iran 's nuclear program . " Firm new : netanyahu : " i think there 's a third alternative , and that is standing firm , " netanyahu says . obama 's comments come as democrats and republicans spar over the framework announced last week to lift western sanctions on iran . Table 5: Examples from Gigaword and CNN datasets and corresponding summaries generated by competing models. The tagged part of text is marked bold and preceded with at sign (@). The red color fill represents the attention scores given to each entity. We only report the attention scores of entities in the Gigaword example for conciseness since there are 80 linked entities in the CNN example.
Text d Linked entity: https://en.wikipedia.org/wiki/United_States E1.1: andy roddick got the better of dmitry tursunov in straight sets on friday , assuring the @united states a #-# lead over defending champions russia in the #### davis cup final . 0.719 E1.2: sir alex ferguson revealed friday that david beckham 's move to the @united states had not surprised him because he knew the midfielder would not return to england if he could not come back to manchester united .
litical entities (e.g. "republicans", "democrats"). This is due to the fact that only the k = 10 most important elements are attended to create the summary topic vector.

Conclusion
We proposed to leverage on linked entities to improve the performance of sequence-to-sequence models on neural abstractive summarization task. Linked entities are used to guide the decoding process based on the summary topic and commonsense learned from a knowledge base. We introduced Entity2Topic (E2T), a module that is easily attachable to any model using an encoder-decoder framework. E2T applies linked entities into the summarizer by encoding the entities with selective disambiguation and pooling them into one summary topic vector with firm attention mechanism. We showed that by applying E2T to a basic sequence-to-sequence model, we achieve significant improvements over the base model and consequently achieve a comparable performance with more complex summarization models.