NMF Ensembles? Not for Text Summarization!

Non-negative Matrix Factorization (NMF) has been used for text analytics with promising results. Instability of results arising due to stochastic variations during initialization makes a case for use of ensemble technology. However, our extensive empirical investigation indicates otherwise. In this paper, we establish that ensemble summary for single document using NMF is no better than the best base model summary.


Introduction
Non-negative Matrix factorization (NMF) has demonstrated promise in text analytic tasks like topic modeling (Suh et al., 2017;Qiang et al., 2018;Belford et al., 2018), document summarization (Lee et al., 2009;Khurana and Bhatnagar, 2019) and document clustering (Shahnaz et al., 2006;Shinnou and Sasaki, 2007). The method finds favour due to the presence of non-negative elements in resultant factor matrices, which enhance intuitive understanding of the underlying latent semantic structure of the text (Lee and Seung, 1999).
Recent applications of ensemble methods for NMF based topic modeling has shown considerable promise (Suh et al., 2017;Qiang et al., 2018;Belford et al., 2018). These observations drive our motivation for exploring NMF ensembles for document summarization task.

NMF for Text Summarization
Consider a (pre-processed) document D consisting of n sentences ( S 1 , S 2 , . . . , S n ) and m terms ( t 1 , t 2 , . . . , t m ) represented by a Boolean term-sentence matrix A m×n . NMF decomposition of A results into two non-negative factor matrices W and H, where W is m × r term-topic (feature) matrix and H is r×n topic-sentence (co-efficient) matrix, with r min{m, n}.
Columns in W correspond to document topics represented as τ 1 , τ 2 . . . τ r in the latent semantic space, and columns in H represent sentences in D. Element w ij in W signifies the contribution of term t i in topic τ j , and element h ij in H denotes the strength of topic τ i in sentence S j . Deft manipulation of the elements of two factor matrices yields distinctive sentence scores (Lee et al., 2009;Khurana and Bhatnagar, 2019). Top scoring sentences are selected to generate summary of desired length. Instability of NMF for Text Summarization: Even though NMF based automatic text summarization is unsupervised and carries advantages of language, domain and collection independence, yet it has been used sporadically for summarization. The reason can be linked to stochastic variations in factor matrices due to random initialization.
Repeated NMF factorization of the input termsentence matrix results into different sentence scores, generating different summaries. This ambivalence renders the resulting NMF summaries dubitable. In authors' opinion, this has retarded development in this line of research. Lee et al. (2009) suggested a simplistic fix to this problem by using static initialization for W and H. Experiment using DUC2002 1 data-set with varying initial seed values for W and H shows that the best initialization value is document specific (Fig. 1). Hence, fixed initialization of factor matrices is not a prudent idea.
Another fix for the problem is to use NNDSVD (Non-negative Double Singular Value Decomposition (Boutsidis and Gallopoulos, 2008)) based initialization for NMF factor matrices. Our earlier work (Khurana and Bhatnagar, 2019) establishes that this initialization method improves the sum- mary quality over fixed initialization for several benchmark data-sets.

NMF Ensembles
Random initialization has been exploited by clustering and topic modeling researchers to create ensembles (Greene et al., 2008;Belford et al., 2018;Qiang et al., 2018). Ensembling is a machine learning technique, which combines (multiple) varying base models to construct a consensus model, which is expected to perform better than individual base models.
Effectiveness of NMF ensembles in text analytics, specifically in topic modeling (Belford et al., 2018;Qiang et al., 2018), motivated the current research. We initiated the study with the aim to leverage stochastic variations in NMF factors and resulting diverse base summaries, to combine into stable ensemble summary.
Extrapolating earlier studies, we expected that NMF ensemble summary, smoothed over multifarious summaries obtained by randomly initialized NMF factors, will accomplish higher ROUGE scores. However, our investigations establish that NMF ensembles are not effective. Rather, despite heavy overhead of creating multiple base models and combining them, NMF ensemble often perform worse than the best base model for single document extractive summarization.

NMF Ensembles for Text Summarization
Ensemble methods are employed in supervised, semi-supervised and unsupervised learning settings. In all scenarios, they comprise two phases.
In the first phase, diverse base models are generated. Diversity in models is recognized to be the key factor for improvement (Kuncheva and Hadjitodorov, 2004), which is commonly sourced from variations in base algorithm, algorithmic parameters or data itself.
In second phase, multiple base models are combined using a consensus (aka integration) function. Wide variety of choices for creating diversity and combining base models gives rise to numerous possibilities for creating ensembles (Zhou, 2012).

Generation of diverse base models
Repeated application of NMF on the term-sentence matrix leads to generation of multiple base models. In the present context, we achieve diversity using two methods. (i) Repeated factorization 2 of A using NMF with random initial seed values for W and H, and (ii) Repeated factorization by varying the number of latent topics into which the document is decomposed, as suggested in (Greene et al., 2008). This strategy implicitly embeds variations that arise out of random initialization.
In practice, variation in choice of NMF solvers and initialization methods is also a source of diversity. We however, refrain from following this direction because of weak scientific ground.

Combining NMF base models
We examine six combining methods, in increasing order of complexities, to generate consensus summaries. First three are simple aggregation methods, where sentence scores are combined directly. Next two methods are based on rank manipulation of the scored sentences. Finally, we use Stacking, which is a sophisticated combining method (Zhou, 2012;Belford et al., 2018). i. Average: We calculate the consensus score of each sentence in the document by averaging sentence scores over all base models, and use it for summary sentence selection. ii. Median: Since average is sensitive to outliers, we calculate median score of each sentence across all base models and use it for summary sentence selection. iii. Quartile: We obtain consensus score by considering third quartile of the sentence scores across all base models, and use it for summary sentence selection. iv. Voting: We rank sentences based on their scores. Consensus rank of a sentence is the most frequent rank (majority) of the sentence amongst base models. Top ranking sentences are selected for summary. v. Ranking: We count the number of times a sentence appears among the top k scoring sentences (k is the desired number of sentences). Finally, sentences are ranked based on the frequency of appearing among top-k ranked sentences. Top scoring sentences are included in summary. vi. Stacking: Stacking is a well established combining method, which combines base level models to create a meta-training set (Zhou, 2012). Subsequently, ensemble model is trained on this metatraining set. We stack topic-term (W T ) matrices (base level models) as meta-training set ( W ) for producing the stacked ensemble (Belford et al., 2018). W matrix is factorized using NMF with NNDSVD initialization to obtain ensembled topicterm matrix, which along with A is used for scoring sentences using term-oriented sentence selection method, NMF-TR, proposed in (Khurana and Bhatnagar, 2019). Finally, top-k scoring sentences are included in summary.

Performance Evaluation
In this section, we present extensive experimentation 3 carried out to investigate the performance of NMF ensembles for extractive summarization. First, we evaluate the performance based on combining methods described in Sec. 2 and the sizes of ensemble. Next, we study the effect of diversity in base models on ensemble performance. Finally, we test the statistical significance of our results. All experiments are performed on DUC2001 4 dataset consisting of 308 documents and DUC2002 1 data-set. We report macro averaged ROUGE recall scores (R-1: ROUGE-1, R-2: ROUGE-2, R-L: ROUGE-L) (Lin, 2004).
In interest of brevity, all results are reported as performance gain (+) or loss (-) over NMF-TR scores as baseline method. Table 1 shows ROUGE scores of this baseline for DUC2001 and DUC2002 data-sets reported in (Khurana and Bhatnagar, 2019

Examining Combining Methods
Primary objective of this experiment is to examine the comparative performance of model integration methods. However, size of an ensemble is a crucial factor that determines the quantum of performance gain. Oversized ensembles have obvious computational and memory overheads, while undersized ensembles run the risk of little performance gains and reduced stability. Ergo, we evaluate the performance of each combining method on ensembles of varying sizes. We compute macro-averaged ROUGE recall scores, each score averaged over ten executions to combat random variations. Table 2 shows the performance differential for six combining methods for ten different sizes for DUC2002 data-set. A cursory glance is sufficient to conclude that there are more negs than pos'. Degradation in performance is most unexpected. Macro-level analyses in the bottom row and rightmost column consolidate the surprise. The bottom row 'Total' shows the number of times the ensemble improves summary quality across all combining methods. For ensemble size 100, there is ≈50% chance (9/18) of improving the summary quality across all methods. The rightmost column 'Total' shows the number of times a combining method improves summary across all sizes for each integration method. It suggests that simple combining methods improve marginally even for large size ensemble.
To confirm the trend, we repeated the same experiment with DUC2001 data-set (Table 3). Apparently there is better chance of improvement for this data-set using NMF ensembles, but the gain is meagre (less than 0.5 in each case) and does not justify the computational overhead.
Consolidating observations from Table 2 & 3, none of the the combining methods yield noticeably better quality summaries than the baseline method. Further, increasing the size of ensemble also does not hold promise.
Since   Table 3: Performance differential in macro-averaged ROUGE recall scores w.r.t NMF-TR for different ensemble sizes and combining methods. is the size of ensemble. '-' indicates that the integration method is not meaningful.

Diversity in Base Models
Since summary scores are sensitive to the number of latent topics into which document is decomposed, varying the number of latent topics while decomposing the term-sentence matrix is a potential source of diversity in NMF base models. We explore two different ways to accomplish this.
Selecting latent topics from range: We create 100 base models with number of latent topics randomly chosen from the range [r, 2r], where r is determined using method proposed by (Aliguliyev, 2009). We expect that random initialization and variation in the number of topics would inject diversity in base models. We do not test this method with stacking because it requires stacked matrices to have same number of columns. Results for this experiment (  quality of consensus summary.
Varying latent topics over range: Suspecting that repetition in the number of latent topics in previous experiment curb diversity in the ensemble, we attempt to create diversity by generating base models with all values in the range [r, 2r]. Hence, the size of ensemble for this experiment is document specific. Here too, it is not possible to create a stacking ensemble because of different number of latent topics in each base model. Results for this diversity creation method are presented in Table 5.  also fails to infuse diversity and shows no promise.

Comparison with best summary
With no success in injecting diversity in base models, we proceed to perform deeper analysis to diagnose the cause of degradation. We wanted to answer the question 'How many base models are responsible for pulling down the score of consensus summary?'.
To answer this, we evaluated all base summaries and noted their ROUGE scores. The score of the best base summary was compared against that of ensemble summary and translated to win (if ensemble summary score is higher or equal), and loss otherwise. This exercise was done for all integration methods and ensemble size 30 and 100.
Results shown in Table 6 are almost startling. E.g. 23/510 for ensemble size 30 means that out of 533 total documents, for 23 documents the ensemble summary was atleast as good as the best base summary. For 510 documents, ensemble summary score was worse than that of the base summary. Thus NMF ensemble summaries fail miserably to improve quality over the best base summary.

Statistical Significance of Combining methods
We investigate the statistical significance of our results for all combining methods. We employ bootstrap approach recommended in (Dror et al., 2018), and test the null hypothesis H0: NMF ensemble method performs no worse than the baseline NMF-TR, against the alternative hypothesis,  ods are not statistically significantly better than the baseline method.

Discussion and Conclusion
Extensive empirical investigation shows that leveraging stochastic variations due to random initialization of NMF factor matrices for extractive document summarization is not straight-forward. We experimented with different NMF solvers available in (Pedregosa et al., 2011) and found no change in results. In absence of any concrete explanation for degraded performance, we forward two plausible reasons. First, apparently simple combining methods fail to tease apart the differences in term-topic and topic-sentence strengths in the latent space. Possible future investigation in this direction include projection of these matrices in higher dimension, and drawing from the cluster ensemble research to design more sophisticated combining methods.
Second reason is related to the maxim of sentence ranking and selection for extractive document summarization. Combining scores from base models alters sentence ranking, and probably less important sentence get pulled up in the summary. This is most likely to happen with the lowest ranked sentence in the summary. A single bad sentence in the summary can lower down the score substantially. Achieving stable ranks in ensemble technology could be another direction of research.