Centroid-based Text Summarization through Compositionality of Word Embeddings

The textual similarity is a crucial aspect for many extractive text summarization methods. A bag-of-words representation does not allow to grasp the semantic relationships between concepts when comparing strongly related sentences with no words in common. To overcome this issue, in this paper we propose a centroid-based method for text summarization that exploits the compositional capabilities of word embeddings. The evaluations on multi-document and multilingual datasets prove the effectiveness of the continuous vector representation of words compared to the bag-of-words model. Despite its simplicity, our method achieves good performance even in comparison to more complex deep learning models. Our method is unsupervised and it can be adopted in other summarization tasks.


Introduction
The goal of text summarization is to produce a shorter version of a source text by preserving the meaning and the key contents of the original text. This is a very complex problem since it requires to emulate the cognitive capacity of human beings to generate summaries. Thus, text summarization poses open challenges in both natural language understanding and generation. Due to the difficulty of this task, research work in the literature focused on the extractive aspect of summarization, where the generated summary is a selection of relevant sentences from a document (or a set of documents) in a copy-paste fashion. A good extractive summarization method must satisfy and optimize both coverage and diversity properties, where the selected sentences should cover a sufficient amount of topics from the original source text, avoiding the redundancy of information in the summary. The diversity property is fundamental especially for a multi-document summarization. For instance in a news aggregator, a selection of too similar sentences may compromise the quality of the generated summary.
An extractive method should define a sentence representation model, a technique for assigning a score to each sentence in the original source and a ranking module to properly select the most relevant sentences by relying on a similarity function. Following this vision, several summarization methods proposed in the literature use the bag of words (BOW) as representation model for the sentence scoring and selection modules Erkan and Radev, 2004;Lin and Bilmes, 2011). Despite their proven effectiveness, these methods rely heavily on the notion of similarity between sentences, and a BOW representation is often not suitable to grasp the semantic relationships between concepts when comparing sentences. For example, taking into account the following two sentences "Syd leaves Pink Floyd" and "Barrett abandons the band", in the BOW model their vector (sparse) representations result orthogonal since they have no words in common, nonetheless the two sentences are strongly related.
In attempt to solve this issue, in this work we propose a novel and simple extractive summarization method based on the geometric meaning of the centroid vector of a (multi) document by taking advantage of compositional properties of the word embeddings (Mikolov et al., 2013b). Empirically, we prove the effectiveness of word embeddings with a fair comparison to the BOW representation by limiting, as much as possible, the parameters and the complexity of the method. Surprisingly, the results achieved from our method on the gold standard DUC-2004 dataset are comparable, and in some cases better, to those obtained using a more complex sentence representations coming from the deep learning models.
In the following section we provide a brief description of word embeddings and text summarization methods. The centroid-based summarization method that uses word embeddings is described in Section 3, followed by experimental results in Section 4. Final remarks and a discussion about our future plans are reported in Section 5.

Word Embeddings
Word embedding stands for a continuous vector representation able to capture syntactic and semantic information of a word. Several methods have been proposed in order to create word embeddings that follow the Distributional Hypothesis (Harris, 1954). In our work we use two models 1 , continuous bag-of-words and skip-gram, introduced by (Mikolov et al., 2013a). These models learn a vector representation for each word using a neural network language model and can be trained efficiently on billions of words. Word2vec allows to learn complex semantic relationships using simple vectorial operators, such as vec(king) − vec(man) + vec(woman) ≈ vec(queen) and vec(Barrett) − vec(singer) + vec(guitarist) ≈ vec(Gilmour). However, our method is general and other approaches for building word embeddings can be used (Goldberg, 2015).

Text Summarization
Since the first method proposed by (Luhn, 1958), automatic text summarization has been widely addressed by the research community with the proposal of different methodologies as well as toolkits (Saggion and Gaizauskas, 2004). Good surveys are proposed by (Jones, 2007;Saggion and Poibeau, 2013). Since our method exploits word embeddings as alternative representation to BOW, here we focus on the methods sharing this feature. Methods based on matrix factorization, such as Latent Semantic Analysis (LSA) (Ozsoy et al., 2011) and Non-Negative Matrix Factorization (NMF) (Lee et al., 2009), have the aim to arise the latent factors by producing dense and compact representations of sentences. Recently, riding the wave of prominent results of modern Deep Learning (DL) models in many natural language pro-1 Commonly called word2vec. cessing tasks Goodfellow et al., 2016), several groups have started to exploit deep neural networks for both abstractive (Rush et al., 2015;Nallapati et al., 2016) and extractive (Kågebäck et al., 2014;Cao et al., 2015;Cheng and Lapata, 2016) text summarization.

Centroid-based Method
The centroid-based method for extractive summarization was introduced by . The centroid represents a pseudo-document which condenses the meaningful information of a document 2 . The main idea is to project in the vector space the vector representations of both the centroid and each sentence of a document. Then, the sentences closer to the centroid are selected. The original method adopts the BOW model for the vector representations using the tf * idf weight scheme (Salton and McGill, 1986), where the size of vectors is equal to that of the document vocabulary. We adapt the centroid-based method introducing a distributed representation of words where each word in a document is represented by a vector of real numbers of an established size. Formally, given a corpus of documents [D 1 , D 2 , . . . ] and its vocabulary V with size N = |V |, we define a matrix E ∈ R N,k , so-called lookup table, where the i-th row is a word embedding of size k, k << N ,  Table 1: Centroid words of the Donkey Kong (video game) article having the tf-idf values greater than a topic threshold equal to 0.3. For each centroid word, the five closest words are shown using the skip-gram model trained on the Wikipedia (en) content. In the last column the words most similar to the centroid embedding computed using element-wise addition are shown.
of the i-th word in V . The values of the word embeddings matrix E are learned using the neural network model introduced by (Mikolov et al., 2013b). The model can be trained on the collection of documents to be summarized or on a larger corpus. This is a peculiar advantage of Representation Learning (RL) (Bengio et al., 2013) that allows to reuse an external knowledge and this is especially useful for the summarization of documents in specific domains, where large amount of data are not available. After learning the lookup table, our summarization method consists of four steps: 1) preprocess the input document; 2) build the centroid embedding; 3) compute the sentence scores; 4) select the relevant sentences.

Preprocessing
The first step follows the common pipeline for the summarization task: split the document into sentences, convert all words in lower case and remove stopwords. Stemming is not performed because we let the word embeddings to discover the linguistic regularities of words with the same root (Mikolov et al., 2013c). For instance, the most similar embeddings to the words that compose the centroid vector using the skip-gram model trained on the Wikipedia content are reported in Table 1. The closest word of arcade is its plural arcades, while they are orthogonal in the vector space according to the BOW representation.

Centroid Embedding
In order to build a centroid vector using word embeddings, we first select the meaningful words into the document. For simplicity and a fair comparison with the original method, we select those words having the tf * idf weight greater than a topic threshold. Thus, we compute the cen-troid embedding as the sum 3 of the embeddings of the top ranked words in the document using the lookup table E.
In the eq. (1) we denote with C the centroid embedding related to the document D and with idx(w) a function that returns the index of the word w in the vocabulary. In the headers of the Table 1 the centroid words extracted from a Wikipedia article are reported. The last column shows the words most similar to centroid embedding computed using element-wise addition. It is important to underline that all five closest words to the centroid vector are semantically related to the main topic of the document despite the size of the Wikipedia vocabulary (about 1 million words).

Sentence Scoring
For each sentence in the document, we create an embedding representation by summing the vectors for each word in the sentence stored in the lookup table E.
In the eq. (2) we denote with S j the j-th sentence in the document D. Then, the sentence score is computed as the cosine similarity between the embedding of the sentence S j and that of the centroid C of the document D.
(3) Figure 1 shows a visualization of sentence and centroid embeddings of a Wikipedia article. We The NES version was re-released as an unlockable game in Animal Crossing for the GameCube and as an item for purchase on the Wii's Virtual Console.
0.9308 Table 2: The most relevant sentences of the Donkey Kong article selected with the centroid-based summarization method using word embeddings. For each sentence are reported the related position ID in the document and the similarity score computed between sentence and centroid embeddings. The words that compose the centroid vector are marked in bold. The most similar words to the centroid ones are reported in italic.
use t-SNE method (van der Maaten and Hinton, 2008) to reduce the dimensionality of vectors from 300 to 2. For each sentence the position ID in the document is shown. The closest sentences to the centroid embedding are marked in green. The words that compose the centroid are the same showed in Table 1. In Table 2 we report the sentences near to the centroid with the related cosine similarity values. As we expected, the most relevant sentence (136) contains many words close to the centroid vector. However, the relevant aspect concerns the last sentence (135). Despite this sentence does not contain any centroid word, it has a high similarity value so it is close to the centroid embedding in the vector space. The reason is due to the presence of the words, such as NES, GameCube and Wii, that are the closest words to the centroid embedding (Table 1). This proves the effectiveness of the compositionality of word embeddings to encode the semantic relations between words through vector dense representations.

Sentence Selection
The sentences are sorted in descending order of their similarity scores. The top ranked sentences are iteratively selected and added to the summary until the limit 4 is reached. In order to satisfy the redundancy property, during the iteration we compute the cosine similarity between the next sentence and each one already in the summary.
We discard the incoming sentence if the similarity value is greater than a threshold. This procedure is reported in Algorithm 1. However, sim-ilar sentence selection approaches are described in (Carbonell and Goldstein, 1998;Saggion and Gaizauskas, 2004).

Experiments
In this section we describe the benchmarks conducted on two text summarization tasks. The main goal is to compare the centroid-based method using two different representations (bag-of-words and word embeddings). In Section 4.1 and in Section 4.2 we report the experimental results carried out on Multi-Document and Multilingual Single Document summarization tasks, respectively.  (Lin, 2004), a set of recallbased metrics that compare the automatic and human summaries on the basis of the n-gram overlap.
In our experiment, we adopt both ROUGE-1 and ROUGE-2 6 .
Baselines For the comparison, we propose several baselines. Firstly, we adapt the centroid method proposed by  (C BOW) for a fair comparison. In the original work, the sentence scores are the linear combination of the centroid score, the positional value and the first sentence overlap. The centroid score is the sum of tf * idf weights of the words occurring both in the sentence and in the centroid. In our experiment, we apply both our sentence score and selection algorithms. LEAD simply chooses the first 665 bytes from the most recent article in each cluster. SumBasic is a simple probabilistic method proposed by (Nenkova and Vanderwende, 2005) commonly used as baseline in the summarization evaluation. Peer65 is the winning system in DUC-2004 Task 2. To compare our method with others which also use compact and dense representations, we use the method proposed by (Lee et al., 2009) that adopts the generic relevance of sentences method using NMF. Another method often used in summarization evaluations is LexRank proposed by (Erkan and Radev, 2004) which uses the TextRank algorithm (Mihalcea and Tarau, 2004) to establish a ranking between sentences. Finally, we compare our method with the one proposed by (Cao et al., 2015)  Implementation Our system 7 is written in Python by relying on nltk, scikit-learn and gensim libraries for text preprocessing, building the sentence-term matrix and import the word2vec model. We train the word embeddings on DUC-2004 corpus using the original word2vec 8 implementation. We test both continuous bag-of-words (C CBOW) and skip-gram (C SKIP) neural architectures proposed in (Mikolov et al., 2013a) using the same parameters 9 but varying the embedding sizes. Moreover, we compare our method using the model trained on a part of Google News dataset (C GNEWS) which consists of about 100 billion words. In the preprocessing step each cluster of documents is divided in sentences and stopwords are removed. We do not perform stemming as reported in Section 3. To find the best parameters configuration, we run a grid search using this setting: embedding size in [100,200,300,400,500], topic and similarity thresholds respectively in [0, 0.5] and [0.5, 1] with a step of 0.01.

Results and Discussion
The results of the experiment are shown in Table 3. We report the best scores of our method using the three different word2vec models along with their parameters. For all word embeddings models, our method outperforms the original centroid one. In detail, with the skip-gram model we obtain an increment of 1.05% and 1.71% with respect to the BOW model using ROUGE-1 and ROUGE-2 respectively. Moreover, our simple method with skip-gram performs bet-ter than the more complex models based on RNN. This proves the effectiveness of the compositional capability of word embeddings in order to encode the information word sequences by applying a simple sum of word vectors, as already proved in (Wieting et al., 2015). Although our method with the model pre-trained on Google News does not achieve the best score, it is interesting to notice the flexibility of the word embeddings in reusing external knowledge. Regarding the comparison between BOW and embedding representations, the experiment shows different behaviors of the similarity threshold. In particular, the use of word2vec requires a higher threshold because the word embeddings are dense vectors unlike the sparse representation of BOW. This proves that the embeddings of sentences are closer in the vector space, thus the cosine similarity returns closer values.  Also the topic threshold shows different trends. The word embeddings require a higher threshold value to make our method effective. In order to analyze this aspect we run another experiment setting the topic threshold to 0. The results are reported in Table 4. Results show that the BOW representation is more stable and obtains the best ROUGE-2 score, while the performance obtained by word2vec decreases considerably. This means that word embeddings are more sensitive to noise and they require an accurate choice of the meaningful words to compose the centroid vector.
Summaries Overlap Although the different methods achieve similar ROUGE scores, they not necessarily generate similar summaries. An example is reported in Table 6. In this section we conduct a further analysis by comparing the summaries generated by the best four configurations of the centroid method reported in Table 3. We adopt the same criterion presented in (Hong et al., 2014), where the different summaries are compared in terms of sentences and words overlap using the Jaccard coefficient. Due to space constraint, we report in Table 5   lead to different summaries. In particular, the summaries using BOW differ considerably from those generated using word2vec, but this is true even for different embedding models. On the other hand, only the models trained on the DUC-2004 corpus (CBOW and SKIP) tend to generate more similar summaries. This analysis suggests that a combination of various models trained on different corpora could result in good performance.

Multilingual Document Summarization
Task Description We carried out an experiment on Multilingual Single-document Summarization (MSS). Our main goal is to prove empirically the effectiveness of the use of word embeddings in the document summarization task across different languages. For this purpose, we evaluate our method on the MSS task proposed in Multi-Ling 2015 (Giannakopoulos et al., 2015), a special session at SIGDIAL 2015. Starting from 2011 the aim of MultiLing community is to promote the cutting-edge research in automatic summarization by providing datasets and by introducing several pilot tasks to encourage further developments in single and multi-document summarization and in summarizing human dialogs in on-line forums and customer call centers. The goal of the MSS 2015 task is to generate a single document summary from a selection of some of the best written Wikipedia articles with at least one out of 38 languages defined by organizers of the task. The dataset 10 is divided into a training and a test sets, both consisting of 30 documents for each of 38 languages. For both datasets, the body of the articles and the related abstracts with the character length limits are provided. Since the Wikipedia abstracts are summaries written by humans, they are useful to perform automatic evaluations. We evaluate our method using five different languages: English, Italian, German, Spanish and French.

BOW -Bag of Words baseline
The controversy centers on the payment of nearly dlrs 400,000 in scholarships to relatives of IOC members by the Salt Lake bid committee which won the right to stage the 2002 games. Pound said the panel would investigate allegations that "there may or may not have been payments for the benefit of members of the IOC or their families connected with the Salt Lake City bid." Samaranch said he was surprised at the allegations of corruption in the International Olympic Committee made by senior Swiss member Marc Hodler.

CBOW -Continuous Bag of Words trained on DUC-2004 dataset
Marc Hodler, a senior member of the International Olympic Committee executive board, alleged malpractices in the voting for the 1996 Atlanta Games, 2000 Sydney Olympics and 2002 Salt Lake Games. The IOC, meanwhile, said it was prepared to investigate allegations made by Hodler of bribery in the selection of Olympic host cities. The issue of vote-buying came to the fore in Lausanne because of the recent disclosure of scholarship payments made to six relatives of IOC members by Salt Lake City officials during their successful bid to play host to the 2002 Winter Games.

SKIP -Skip-gram trained on DUC-2004 dataset
Marc Hodler, a senior member of the International Olympic Committee executive board, alleged malpractices in the voting for the 1996 Atlanta Games, 2000 Sydney Olympics and 2002 Salt Lake Games. The IOC, meanwhile, said it was prepared to investigate allegations made by Hodler of bribery in the selection of Olympic host cities. Saying "if we have to clean, we will clean," Juan Antonio Samaranch responded on Sunday to allegations of corruption in the Olympic bidding process by declaring that IOC members who were found to have accepted bribes from candidate cities could be expelled.

GNEWS -Skip-gram trained on Google News dataset
The International Olympic Committee has ordered a top-level investigation into the payment of nearly dlrs 400,000 in scholarships to relatives of IOC members by the Salt Lake group which won the bid for the 2002 Winter Games. The mayor of the Japanese city of Nagano, site of the 1998 Winter Olympics, denied allegations that city officials bribed members of the International Olympic Committee to win the right to host the games. Swiss IOC executive board member Marc Hodler said Sunday he might be thrown out of the International Olympic Committee for making allegations of corruption within the Olympic movement. Model Configuration In order to learn word embeddings for the different languages, we exploit five Wikipedia dumps 11 , one for each chosen language. We extract the plain text from the Wiki markup language using Wikiextractor 12 , a Wikimedia parser written in Python. Each article is converted from UTF-8 to ASCII encoding using the Unidecode Python package. Since in the previous evaluation we observe a similar behavior between the continuous bag of words and skipgram models, in this evaluation we adopt only the skip-gram one using the same training parameters 13 for all five languages. The Table 7 [en, it, de, es, fr] 12 https://github.com/attardi/wikiextractor/wiki 13 -hs 1 -min-count 10 -window 8 -negative 5 -iter 5 MSS task, we performed the tuning of parameters using only the training set. To find the best topic and similarity threshold parameters we run a grid search as explained in Section 4.1. The grid search is performed for each language separately using both BOW and skip-gram representations. The parameter configurations are in line with those of the previous experiment on DUC-2004. In detail, the topic thresholds are in the range [0.1, 0.2] using the BOW model and in the range [0.3, 0.5] using word embeddings. While, the similarity thresholds are slightly higher w.r.t. the multi-document experiment: about 0.7 and 0.95 for BOW and skip-gram, respectively. This is due to the fact that too similar sentences are rare, especially with well-written documents as Wikipedia articles. The best parameters configuration for each language is used to generate summaries for the documents in the test set. Also for this task, each document is preprocessed with the sentences segmentation and stopwords removal, without stemming. We adopt the same automatic evaluation metrics used by the participating systems in MSS 2015 task: ROUGE-1, -2, -SU4 14 . ROUGE-SU4 computes the score between the generated and human summaries considering the overlap of the skip-bigrams of 4 as well as the unigrams. Finally, the generated summary for each document must comply with a specific length constraint (rather than using a unique length limit for the whole collection). This differs

Results and Discussion
The results for each languages are shown in Table 8. We report the ROUGE-1, -2 scores for each chosen language. LEAD and C BOW represent the same baselines used in the multi-document experiment. The former uses the initial text of each article truncated to the length of the Wikipedia abstract. The latter is the centroid-based method with the BOW representation. Our method that uses word embeddings learned with skip-gram model is labeled with C W2V. For each metric and language we also report the WORST and the BEST scores 15 obtained by the 23 participating systems at MSS 2015 task. Finally, ORACLE scores can be considered as an upper bound approximation for the extractive summarization methods. It uses a covering algorithm (Davis et al., 2012) that selects sentences from the original text covering the words in the summary without disregarding the length limit. We highlight in bold the scores of our method when it outperforms the baseline C BOW. On the other hand, the superscripts † and ‡ imply a better performance of our method with respect to the WORST and the BEST scores respectively.
Both centroid-based methods overcome the simple baseline over all languages. Our method always achieves better scores against the BOW model except for ROUGE-2 metric for English. This confirms the effectiveness of using word embeddings as alternative sentence representations able to capture the semantic similarities between the centroid words and the sentences, when summarizing single documents too. Moreover, our method outperforms substantially the lowest scores performing systems participating in MSS 2015 task for English and German languages. For English our method obtains a ROUGE-1 score even better than the one of the best system in MSS 2015. Instead, our method fails in summarizing Italian documents and it achieves the worst ROUGE-2 for Spanish and French languages. The reason may lie in the size of the Wikipedia dumps used to learn the word embeddings for different languages. As showing in Table 7, the sizes of the various corpora as well as the ratios between the number of words and dimension of the vocabularies, differ consistently. The English version of Wikipedia consists of nearly 2 billion words against about 300 million words of Italian one. Thus, according to the distributional hypothesis reported in (Harris, 1954), we expect better performance for our method in summarizing English or German articles with respect to the other languages where the word embeddings are learned using a smaller corpus. Our results and in particular the ROUGE-SU4 scores reported in Figure 2 support this hypothesis.

Conclusion
In this paper, we propose a centroid-based method for extractive summarization which exploits the compositional capability of word embeddings. One of the advantages of our method lies on its simplicity. Indeed, it can be used as a baseline in experimenting new articulate semantic representations in summarization tasks. Moreover, following the idea of representation learning, it is feasible to infuse knowledge by training the word embeddings from external sources. Finally, the proposed method is fully unsupervised, thus it can be adopted in other summarization tasks, such as query-based document summarization. As future work, we plan to evaluate the centroidbased summarization method using a topic model, such as Latent Dirichlet Allocation (LDA) (Blei et al., 2003) or Non-negative Matrix Factorization (NMF) (Berry et al., 2007), in order to extract the meaningful words to compute the centroid embedding as well as to carry out a comprehensive comparison of different sentence representations using more complex neural language models (Le and Mikolov, 2014;Zhang and Le-Cun, 2015;Józefowicz et al., 2016). Finally, the combination of distributional and relational semantics (Fried and Duh, 2014;Verga and McCallum, 2016;Rossiello, 2016) applied to extractive text summarization is a promising further direction that we want to investigate.