Summarization Based on Embedding Distributions

In this study, we consider a summarization method using the document level similarity based on embeddings, or distributed representations of words, where we assume that an embedding of each word can represent its “meaning.” We formalize our task as the problem of maximizing a sub-modular function deﬁned by the negative summation of the nearest neighbors’ distances on embedding distributions , each of which represents a set of word embed-dings in a document. We proved the sub-modularity of our objective function and that our problem is asymptotically related to the KL-divergence between the probability density functions that correspond to a document and its summary in a continuous space. An experiment using a real dataset demonstrated that our method performed better than the existing method based on sentence-level similarity.


Introduction
Document summarization aims to rephrase a document in a short form called a summary while keeping its "meaning." In the present study, we aim to characterize the meaning of a document using embeddings or distributed representations of words in the document, where an embedding of each word is represented as a real valued vector in a Euclidean space that corresponds to the word (Mikolov et al., 2013a;Mikolov et al., 2013b).
Many previous studies have investigated summarization (Lin and Bilmes, 2010;Lin and Bilmes, 2011;Lin and Bilmes, 2012;Sipos et al., 2012;Morita et al., 2013), but to the best of our knowledge, only one (Kågebäck et al., 2014) considered a direct summarization method using embeddings, where the summarization problem was formalized as maximizing a submodular function defined by the summation of cosine similarities based on sentence embeddings. Essentially, this method assumes linear meanings since the objective function is characterized by the summation of sentence-level similarities. However, this assumption is not always valid in real documents, and thus there may be a better combination of two other sentences than the best and second best sentences in terms of similarity in a document.
In this study, we consider a summarization method based on document-level similarity, where we assume the non-linearity of meanings. First, we examine an objective function defined by a cosine similarity based on document embeddings instead of sentence embeddings. Unfortunately, in contrast to our intuition, this similarity is not submodular, which we disprove later. Thus, we propose a valid submodular function based on embedding distributions, each of which represents a set of word embeddings in a document, as the document-level similarity. Our objective function is calculated based on the nearest neighbors' distances on embedding distributions, which can be proved to be asymptotically related to KLdivergence in a continuous space. Several studies (Lerman and McDonald, 2009;Haghighi and Vanderwende, 2009) have addressed summarization using KL-divergence, but they calculated KLdivergence based on word distributions in a discrete space. In other words, our study is the first attempt to summarize by asymptotically estimating KL-divergence based on embedding distributions in a continuous space. In addition, they involved the inference of complex models, whereas our method is quite simple but still powerful.

Preliminaries
We treat a document as a bag-of-sentences and a sentence as a bag-of-words. Formally, let D be a document, and we refer to an element s ∈ D of a sentence and w ∈ s of a word. We denote the size of a set S by |S|. Note that D and s are defined as multisets. For example, we can define a document such as D := {s 1 , s 2 } with s 1 := {just, do, it} and s 2 := {never, say, never}, which correspond to two sentences "Just do it" and "Never say never," respectively. From the definition, we have |s 1 | = 3 and |s 2 | = 3.

Submodularity
Submodularity is a property of set functions, which is similar to the convexity or concavity of continuous functions.
We formally define submodularity as follows.
Definition 1 (Submodularity). Given a set X, a set function f : 2 X → R is called submodular if for any two sets S 1 and S 2 such that S 1 ⊂ S 2 ⊂ X and element x ∈ X \ S 2 , If a set function f is monotone submodular, we can approximate the optimal solution efficiently by a simple greedy algorithm, which iteratively selects x * = argmax x∈X\S i f S i (x) where ties are broken arbitrarily, and we substitute S i+1 = S i ∪ {x * } in the i-th iteration beginning with S 0 = ∅. This algorithm is quite simple but it is guaranteed to find a near optimal solution within 1 − 1/e ≈ 0.63 (Calinescu et al., 2007).

Embedding
An embedding or distributed representation of a word is a real valued vector in an m-dimensional Euclidean space R m , which expresses the "meaning" of the word. We denote an embedding of a word w by w ∈ R m . If for any two words w 1 and w 2 , the meaning of w 1 is similar to that of w 2 , then w 1 is expected to be near to w 2 .
A recent study (Mikolov et al., 2013a) showed that a simple log-bilinear model can learn high quality embeddings to obtain a better result than recurrent neural networks, where the concept of embeddings was originally proposed in studies of neural language models (Bengio et al., 2003). In the present study, we use the CW Vector 1 and W2V Vector 2 which are also used in the previous study (Kågebäck et al., 2014).

Proposed Method
In this study, we focus on a summarization task as sentence selection in a document. The optimization framework in our task is the same as in the previous study and formalized in Algorithm 1, where w s represents the pre-defined weight or cost of a sentence s, e.g., sentence length, and r is its scaling factor. This algorithm, called modified greedy, was proposed in (Lin and Bilmes, 2010) and interestingly performed better than the stateof-the-art abstractive approach as shown in (Lin and Bilmes, 2011). Note that we have omitted the notation of D from f for simplicity because D is fixed in an optimization process.
Data: Document D, objective function f , and summary size .

Similarity Based on Document Embeddings
First, we examine an objective function f Cos defined by a cosine similarity based on document embeddings. An embedding of a document D is defined as v D := s∈D w∈s w. We formalize the objective function f Cos as follows.
Note that the optimal solution does not change, if we use an average embedding v D / s∈D |s| instead of v D . The next theorem shows that a solution of f Cos by Algorithm 1 is not guaranteed to be near optimal.

Similarity Based on Embedding Distributions
We propose a valid submodular objective function f NN based on embedding distributions. The key observation is that for any two embedding distributions A and B, when A is similar to B, each embedding in A should be near to some embedding in B. In order to formalize this idea, we define the nearest neighbor of a word w in a summary C as n(w, C) := argmin v∈s: We denote the distance of w to its nearest neighbor n := n(w, C) by N (w, C) := d( w, n). Finally, we define f NN as follows: where g is a non-decreasing scaling function. The function f NN represents the negative value −δ of dissimilarity δ between a document and summary based on embedding distributions. Note that we can use sentence embeddings instead of word embeddings as embedding distributions, although we focus on word embeddings in this section. The next theorem shows the monotone submodularity of our objective function, which means that a solution of f NN by Algorithm 1 is guaranteed to be near optimal.
Proof. (Monotonicity) First, we prove the monotonicity. For simplicity, we use the following two abbreviations: C s := C ∪ {s} and  N (w, C)) − g(N (w, C s ))). Since C ⊂ C s , obviously N (w, C) ≥ N (w, C s ) holds.
Therefore, we obtain f NN C (s) ≥ 0 from the nondecreasing property of g.
The objective function f NN is simply heuristic for small documents, but the next theorem shows that f NN is asymptotically related to an approximation of KL-divergence in a continuous space, if g is a logarithmic function. This result implies that we can use mathematical techniques of a continuous space for different NLP tasks, by mapping a document into a continuous space based on word embeddings.
Theorem 3. Suppose that we have a document D and two summaries C 1 and C 2 such that |C 1 | = |C 2 |, which are samples drawn from some probability density functions p, q, and r, i.e., D ∼ p, C 1 ∼ q, and C 2 ∼ r, respectively. If the scaling function g of f NN is a logarithmic function, the order relation of the expectations of f NN (C 1 ) and f NN (C 2 ) is asymptotically the same as that of the KL-divergences D KL (p || r) and Proof. Let m be the dimension on embeddings. Using a divergence estimator based on nearest neighbor distances in (Pérez-Cruz, 2009;Wang et al., 2009)

Experiments
We compared our two proposed methods DocEmb and EmbDist with two state-of-the-art methods SenEmb and TfIdf. The first two methods DocEmb and EmbDist represent Algorithm 1 with our proposed objective functions f Cos and f NN , respectively. TfIdf represents Algorithm 1 with an objective function based on the sum of cosine similarities of tf-idf vectors that correspond to sentences, which was proposed in (Lin and Bilmes, 2011). SenEmb uses a cosine similarity measure based on embeddings instead of tf-idf vectors in the same framework as TfIdf, which was proposed in (Kågebäck et al., 2014).
We conducted an experiment with almost the same setting as in the previous study, where they used the Opinosis dataset (Ganesan et al., 2010). This dataset is a collection of user reviews in 51 different topics such as hotels, cars, and products; thus, it is more appropriate for evaluating summarization of user-generated content than wellknown DUC datasets, which consist of formal news articles. Each topic in the collection comprises 50-575 sentences and includes four and five gold standard summaries created by human authors, each of which comprises 1-3 sentences.
We ran an optimization process to choose sentences within 100 words 3 by setting the summary size and weights as = 100 and w s = |s| for any sentence s, respectively. As for TfIdf and SenEmb, we set a cluster size of k-means as k = |D|/5 and chose the best value for a threshold coefficient α, trade-off coefficient λ, and the scaling factor r, as in (Lin and Bilmes, 2011). Note that our functions DocEmb and EmbDist have only one parameter r, and we similarly chose the best value of r. Regarding DocEmb, EmbDist, and SenEmb, we used the best embeddings from the CW Vector and W2V Vector for each method, and created document and sentence embeddings by averaging word embeddings with tf-idf weights since it performed better in this experiment. In the case of EmbDist, we used a variant of f NN based on distributions of sentence embeddings. In addition, we examined three scaling functions: logarithmic, linear, and exponential functions, i.e., ln x, x, e x , respectively. We calculated the ROUGE-N metric (Lin, 2004) 4 , which is a widely-used evaluation metric for summarization methods. ROUGE-N is based on the co-occurrence statistics of N-grams, and especially ROUGE-1 has been shown to have the highest correlation with human summaries (Lin and Hovy, 2003). ROUGE-N is similar to the BLEU metric for machine translation, but ROUGE-N is a recall-based metric while BLEU is a precision-based metric. Table 1 shows the results obtained for ROUGE-N (N ≤ 4) using DocEmb, EmbDist, SenEmb, and TfIdf. ApxOpt represents the approximation results of the optimal solution in our problem, where we optimized ROUGE-1 with the gold standard summaries by Algorithm 1. The obtained results indicate that our proposed method EmbDist with exponential scaling performed the best for ROUGE-1, which is the best metric in terms of correlation with human summaries. The W2V Vector was the best choice for EmbDist. Furthermore, the other proposed method DocEmb performed better than the state-of-the-art methods SenEmb and TfIdf, although DocEmb is not theoretically guaranteed to obtain a near optimal solution. These results imply that our methods based on the document-level similarity can capture more complex meanings than the sentence-level similarity. On the other hand, TfIdf with tf-idf vectors performed the worst for ROUGE-1. A possible reason is that a wide variety of expressions by users made it difficult to calculate similarities. This also suggests that embedding-based methods naturally have robustness for user-generated content.
In the case of N ≥ 2, TfIdf performed the best for ROUGE-2 and ROUGE-3, while EmbDist with logarithmic scaling is better than TfIdf for ROUGE-4. According to (Lin and Hovy, 2003), the higher order ROUGE-N is worse than ROUGE-1 since it tends to score grammaticality rather than content. Conversely, Rankel et al. (2013) reports that there is a dataset where the higher order ROUGE-N is correlated with human summaries well. We may need to conduct human judgments to decide which metric is the best in this dataset for more accurate comparison. However, it is still important that our simple objective functions can obtain good results competing with the state-of-the-art methods.

Conclusion
In this study, we proposed simple but powerful summarization methods using the documentlevel similarity based on embeddings, or distributed representations of words. Our experimental results demonstrated that the proposed methods performed better than the existing state-of-theart methods based on the sentence-level similarity. This implies that the document-level similarity can capture more complex meanings than the sentence-level similarity.
Recently, Kusner et al. (2015) independently discovered a similar definition to our objective function f NN through a different approach. They constructed a dissimilarity measure based on a framework using Earth Mover's Distance (EMD) developed in the image processing field (Rubner et al., 1998;Rubner et al., 2000). EMD is a consistent measure of distance between two distributions of points. Interestingly, their heuristic lower bound of EMD is exactly the same as −f NN with a linear scaling function, i.e., g(x) = x. Moreover, they showed that this bound appears to be tight in real datasets. This suggests that our intuitive framework can theoretically connect the two wellknown measures, KL-divergence and EMD, based on the scaling of distance. Note that, to the best of our knowledge, there is currently no known study that considers such a theoretical relationship.
In future research, we will explore other scaling functions suitable for our problem or different problems. A promising direction is to consider a relative scaling function to extract a biased sum-mary of a document. This direction should be useful for query-focused summarization tasks.