Machine Learning Approach to Evaluate MultiLingual Summaries

The present paper introduces a new MultiLing text summary evaluation method. This method relies on machine learning approach which operates by combining multiple features to build models that predict the human score (overall responsiveness) of a new summary. We have tried several single and “ensemble learning” classifiers to build the best model. We have experimented our method in summary level evaluation where we evaluate each text summary separately. The correlation between built models and human score is better than the correlation between baselines and manual score.


Introduction
Nowadays, the evaluation of summarization systems is an important step in the development cycle of those systems. In fact, it accelerates the cycle of development by giving an analysis of errors, making an optimization of systems and comparing each system with others. The evaluation of text summary covers its content, its linguistic quality or both. Whatever the type of evaluation (content and/or linguistic quality), the evaluation of system summary output is a difficult task given that in most times there is not a single good summary. In the extreme case, two summaries of the same documents set may have completely different words and/or sentences with different structures. Several metrics have been evaluated the content, the linguistic quality and the overall responsiveness of MonoLing text summaries. We can cite ROUGE (Lin and Hovy, 2003), BE (Hovy et al., 2006), AutoSummENG (Giannakopoulos et al., 2008), BEwTE (Tratz and Hovy, 2008) , etc. Some of those metircs can assess MultiLing text summaries such as ROUGE and AutoSummENG. But, those features can only evaluate the content of MultiLing text summaries.
To encourage research to develop automatic multilingual multi-documents summarization systems a new task, dubbed MultiLing Pilot (Giannakopoulos et al., 2011), has been introduced for the first time in TAC2011 conference. Later, the two workshops 2013 ACL MultiLing Pilot (Giannakopoulos, 2013) and MultiLing 2015 at SIGdial 2015 (Giannakopoulos et al., 2015) have been organised with the same purpose as MultiLing Pilot 2011. The participated summarization systems in the MultiLing task have been assessed using automatic content metrics such as ROUGE-1, ROUGE-2 and MeMoG and a manual metric named Overall Respensiveness which covers the content and the linguistic quality of a text summary. However, the manual evaluation of both the content and the linguistic quality of multilingual multi-documents summarization systems is an arduous and costly process. In addition, the automatic evaluation of only the content of summary is not enough because a summary should also have a good linguistic quality. For this reason, automatic metrics that evaluate the content and the linguistic quality of summaries from several languages should be developed. In this context, we propose a new method based on a machine learning approach for evaluating the overall quality of automatic text summaries. This method could predict the human score (Overall Reponsiveness) of English and Arabic text summaries by combining multiple content and linguistic quality features.
The rest of the paper is organized in the following way: First in Section 2 we introduce the main metrics that have been proposed to evaluate text summaries; then in Section 3 we explain the methodology adopted in our work. In Section 4 we present the different experiments and results for summary level evaluation. Finally, Section 5 describes the main conclusions and possible future works.

Related Works
The summary evaluation task started as Monoling evaluation task. Several manual and automatic metrics have been developed to evaluate the content and the linguistic quality of text summary. Manual evaluation is expensive and timeconsuming. Then, there is a need to assess text summaries automatically. One of the standards in automatic evaluation is ROUGE (Lin and Hovy, 2003). It measures overlapping content between a candidate summary and reference summaries. ROUGE metric scores are obtained through the comparison of common words: N-grams. Later, Giannakopoulos et al. (2008) introduced Auto-SummENG metric, which is based on statistical extracting of textual information from the summary. The information extracted from the summary, represents a set of relations between ngrams in this summary. The n-grams and the relations are represented as a graph where the nodes are the N-grams and the edges represent the relations between them. The calculation of the similarity is performed by comparing the graph of the candidate summary with the graph of each reference summary. In a subsequent work, (Giannakopoulos and Karkaletsis, 2010) have presented Merge Model Graph (MeMoG) which is another variation of AutoSummENG based on n-gram graphs. This variation calculates the merged graph of all reference summaries. Then, it compares the candidate summary graph to the merged graph of reference summaries. Afterwards, the SIMetrix (Summary Input similarity Metrics) measurement was developed by (Louis and Nenkova, 2013); it assesses a candidate summary by comparing it with the source documents. The SIMetrix computes ten measures of similarity based on the comparison between the source documents and the candidate summary. Among the used similarity measures we cite the cosine similarity, the divergence of Jensen-Shannon, the divergence of Kullback-Leibler, etc.
Recently, (Giannakopoulos and Karkaletsis, 2013) proposed NPowER (N-gram graph Powered Evaluation via Regression) metric, which presents a combination of AutoSummENG and MeMoG. They build a linear regression model that pre-dicts a manual (human) score. All the above metrics (ROUGE, AutoSummENG, NPowER and SIMetrix) are used in monolingual and multilingual summary evaluation. Same of those metrics are adapted to multilingual evaluation while others (i.e. AutoSummENG) can from the beginning, support multilingual evaluation.

Proposed Method
From Table 1, we notably remark that in the Arabic language, the correlation between ROUGE-2 and Overall Responsiveness is very low. In addition, almost no correlation exists between MeMoG, AutoSummENG, NPowER and Overall Responsiveness for the Arabic language. Perhaps, this is due to the complexity of the Arabic language structure. For the English language, we note that the correlation between automatic metrics and Overall Responsiveness is better than for the Arabic language but it still low. This motivated us to combine those automatic metrics in order to predict Overall Responsiveness. So, the combination of those metrics will give better correlation. In addition, the Overall Responsiveness score is a real number between 1 and 5 which assesses the content and the linguistic quality of a text summary. This means that we should combine multiple features related to the content and the linguistic quality of a summary. For this reason we have added multiple syntactic features. Then, a predictive model for each language is built by combining multiple features.
The basic idea of the proposed evaluation methodology is based on the prediction of the human grade score (Overall Responsiveness) (Dang and Karolina, 2008) for a candidate summary in Arabic or English languages. This prediction is obtained by the extraction of features from the candidate summary itself, from comparing the candidate summary with the source documents or with reference summaries. To obtain the predictive model for each language, extracted features are combined using a linear regression algorithm. In the following subsections, we will first give the list of used features, then we move to the description of the combination scheme.

Used features
In the proposed method we use several classes of features that are related to the content and the linguistic quality of a text summary. The list of used features are: • ROUGE Scores: ROUGE scores are designed to evaluate the content of a text summary. They are based on the overlap of words N-grams between a candidate summary and one or more reference summaries. According to (Conroy and Dang, 2008), ROUGE variants which take into account large contexts may capture the linguistic qualities of the summary such as some grammatical phenomena. We mean that ROUGE variants that use bigrams, trigrams or more can capture some grammatical phenomena from the well formation of reference sentences. For this reason, we include ROUGE scores which take into account large contexts in the ROUGE feature class:ROUGE-1 (R1), ROUGE-2 (R2), ROUGE-3 (R3), ROUGE-4 (R4) and ROUGE-5 (R5) which calculate respectively words overlaps of bigrams, trigrams, 4-grams and 5-grams.
• AutoSummENG, MeMoG and NPowER scores: Those three scores are based on Ngrams graph (Giannakopoulos and Karkaletsis, 2010) are used to assess the content and the readability of a summary. To calculate these scores, we should adjust three parameters: minimum length of N-grams, maximum length of N-grams and window size between two N-grams. In our experiments, we have used three configurations for each score. The first configuration gives 1 to minimum length of N-gram, 2 to maximum length of N-gram and 3 to window size. The second configuration assigns 3 to minimum length of Ngram, 3 to maximum length of N-gram and 3 to window size. Finally, the third one attributs 4 to minimum length of N-gram, 4 to maximum length of N-gram and 3 to window size. In fact, because Overall responsivness scores evaluate the content and the linguistic quality of summary, we have chosen the first configuration to assess the con-tent and the two other configurations to capture some grammatical phenomena from the well formation of reference sentences. We have assumed that also for those scores configurations which take into account large contexts may capture the linguistic qualities of the summary.
• SIMetrix scores: we have used the following six scores calculated by SIMetrix (Louis and Nenkova, 2013) : the Kullback-Leibler (KL) divergence between the source documents (SDs) and the candidate summary (CS) (KLInputSummary), the KL divergence between the CS and the SDs (KLSummaryInput), the unsmoothed version of Jensen Shannon divergence between the SDs and the CS (unsmoothedJSD) and the smoothed one (smoothedJSD), the probability of uni-grams of the CS given SDs (unigramProb), multinomial probability of the CS given SDs (multi-nomialProb).
• Syntactic features: the syntactic structure of sentences is an important factor that can determine the linguistic quality of texts. (Schwarm and Ostendorf, 2005) and (Feng et al., 2010) used syntactic features to gauge the readability of text as assessment of reading level. While (Kate et al., 2010) used syntactic features to predict linguistic quality of natural-language documents. We implement some of these features using the Stanford parser (Klein and Manning, 2003). We calculate the number and the average number of noun phrases (NP), verbal phrases (VP) and prepositional phrases (PP). The average number of each of the previous phrases is calculated as the ratio between the number of one of the previous phrase type and the total number of sentences.

Combination scheme
Before building a predictive model, we should first calculate the values of all the features.
Then, We select the relevant ones using "wrapper method" (Kohavi and John, 1997). This method evaluates subsets of features which allows to detect the possible interactions between features. It evaluates the performance of each subset of features, then it gives as a result the best one. This does not mean that the other features are not good, but it means that the combination of features from the best subset gives the best performance. Now, to build the predictive model (combination scheme) for a language, we have used several basic (single algorithms) and "ensemble learning" algorithms, implemented by the Weka environment (Witten et al., 2011), using a regression method. For basic algorithms we use "GaussianProcesses", LinearRegression and SMOReg. For "ensemble learning" algorithms, we use "Bagging" (Breiman, 1996), "AdditiveRegression" (Friedman, 1999), "Stacking" (Wolpert, 1992) and "Vote" (Kuncheva, 2004).
After testing the algorithms, we adopt the one that produces the best predictive model. The validation of each model is performed by two methods: cross-validation method with 10 folds and supplied test set method.

Corpus
In this article, we use the TAC 2011 MultiLing Pilot 2011 corpus (Giannakopoulos et al., 2011) and the MultiLing 2013 corpus (Giannakopoulos, 2013). The two corpus involve the source documents, peer summaries, model summaries and automatic and manual evaluation results. The first corpus is available in 7 languages. We use only the Arabic and English documents. For Arabic languages, there are seven participating systems and two baseline systems. While for English language, there are eight participating systems and two baseline systems. For each language, source documents are divided to ten collections of newspaper articles. Each collection includes ten articles related to the same topic. Each collection has three model (human) summaries. Each summarization system is invited to generate a summary for each collection of documents. For MultiLing 2013 corpus, This corpus is available in 10 languages. We use only the Arabic and English documents. For each collection, there are eight participating systems, two baseline systems and 15 collections of newspaper articles. Each collection includes ten articles related to the same topic. Each summarization system is invited to generate a summary for each collection.

Experiments and results
We have experimented our method in summary level evaluation (Micro-evaluation). At this level, we take, for each Summarizer system, each produced summary in a separate entry. It is worth mentioning that this evaluation level is more difficult than system level evaluation (i.e. where the average quality of a summarizing system is measured) even for MonoLingual summary evaluation (Ellouze et al., 2013), (Ellouze et al., 2016). For each language, we have tested several single and "ensemble learning" classifiers integrated on Weka environment and based on regression method like GaussianProcesses, linearRegression, vote, Bagging, etc.
We validate our models using cross-validation with 10 folds and using supplied test set. For cross-validation method, we have calculated the features from "MultiLing 2013" corpus. While, for supplied test set method we have used "Mul-tiLing 2013" corpus as training set and "MultiLing Pilot TAC'2011" corpus as testing set. We have chosen to train our models on "MultiLing 2013" corpus because we have more summaries in this corpus (150 summaries for Arabic and 149 for English). To evaluate the proposed method, we study the correlation of Pearson (Pearson, 1895), Spearman (Spearman, 1910) and Kendall (Kendall, 1938) between the manual scores (Overall Responsiveness) and the scores produced by the proposed method. Furthermore, we report the "Root Mean Squared Error" (RMSE) measure generated by each model. This measure is based on the difference between the manual scores (Overall responsiveness) and the predicted scores.

Arabic Summary Evaluation
We begin with the experiments performed with Arabic language. The selected features for Arabic models are: autosummeng 443 , unsmoothed-JSD, unigramProb, multinomialProb, ROUGE-3 and number of NP phrases in the summary. The Pearson, the Spearman and the Kendall Correlations and the root mean square error (RMSE) generated by each classifier for Arabic language are presented in Table 2. Table 2 shows the performance of the selected features in building the predictive models using In the case of cross validation method, the results show that the model built from the "ensemble learning" classifier "Bagging" produced the best Kendall (0.239) and Spearman (0.335) correlations, "AdditiveRegression" produced the best Pearson (0.337) correlation while the "Gaus-sianProcesses" have produced the lowest RMSE (0.696). In the case of supplied test set method, Table 2 indicates that the best "ensemble learning" classifier is the "Stacking" which provides a model having a Kendall correlation of 0.171 and a Spearman correlation of (0.232) while the "GaussianProcesses" have produced the best Pearson (0.224) correlation and the lowest RMSE. Another notable observation is that the correlation using cross-validation is more important than using supplied test set. Whereas, the RMSE using supplied test set is lower than using cross-validation. This means that the error between the predictive values and the actual values is less important using supplied test set. The decrease of correlation between the cross-validation method and the supplied test set method needs to be studied further in future works.
We pass now to the comparison between the performance of the best obtained model and the baseline metrics that were adopted by the Mul-tiLing workshop such as R-2, MeMoG and also we add the best variant of each of the three other famous metrics AutoSummENG, NPoWER and SIMetrix. Table 3 details the different correlations and RMSEs of baseline metrics and our different experimentations.
From Table 3, the model built from the combination of selected features has the best correlation and RMSE comparing to baselines. When observing the Table 3, we see the gap between baseline metrics and the model build from selected features. In addition, we notice the decrease of correlation on both methods of validation (crossvalidation, supplied test set), when we tried to remove one of the classes of features. Moreover, we remark that removing SIMetrix metric from the selected features have a big effect on its correlation with Overall Responsiveness when using supplied test set as validation method.
Besides, we note that the correlation of the best model with Overall Responsiveness is low, while it is more important than the correlation of baselines. This may be due to the small set of the observations per Arabic language. We need a larger set of observations to determine the best combination of features and to have better correlation. Furthermore, perhaps, this is due to the complexity of the Arabic language structure which is an agglutinative language where agglutination (Grefenstette et al., 2005) occurs when articles, prepositions and conjunctions are attached to the beginning of words and pronouns are attached to the end of words. This phenomenon can greatly influence the operation of comparing the candidate summary with reference summaries. Especially when a word appears in the candidate summary without agglutination while it appears in a reference summary in an agglutinative form and vice versa.

English Summary Evaluation
We pass now to the different experiments performed with English language. The selected features for English models are NPowER 123 , autosummeng 443 , the number of NP phrases in the text summary, the average number of PP per sentence in a text summary. The Pearson, the Spearman and the Kendall Correlations and the rootmean-square error (RMSE) generated by each classifier for English language are presented in Table 4. Table 4 shows the performance of the selected features in building the predictive models using several single and ensemble learning classifiers for the English language. For cross validation method, the results show that the model built from the "ensemble learning" classifier "Bagging" produced the best Kendall (0.393), Spearman (0.537) and Pearson (0.529) correlations and the lowest RMSE (0.652).
For supplied test set validation method, Table 2 indicates that the best "ensemble learning" classifier in terms of correlation and RMSE is also the "Bagging". In fact, this "ensemble learning" has the best correlations (i.e. Kendall: 0.322) and the lowest RMSE (0.754). Again, we note that the correlation using cross-validation is more important than using supplied test set. The decrease of correlation between the cross-validation method and the supplied test set method can be caused by the variation of the human evaluator and/or the change of evaluation guidelines from MultiLing 2011 to MultiLing 2013.
We now move to the comparison between the performance of the best obtained model and the baseline metrics that were adopted by the MultiLing workshop such as ROUGE-2 and MeMoG and also we add the best variant of each of the three other famous metrics AutoSummENG, NPoWER and SIMetrix. Table 5 details the different correlations and RMSEs of baseline metrics, other famous metrics and our best model.
From Table 5, we see the gap between base-line metrics and our experiments, with both validation methods. We have retained the model built from the "Bagging" classifier with both validation methods. We observe also that the elimination of one of the used classes of features decreases the correlation of the best model (built from selected features) with Overall Responsiveness and increases the RMSE. Furthermore, we note that the elimination of syntactic features class decreases enormously the correlation with the use of both methods of validation. The surprising notification is that the elimination of AutoSummENG score increases the correlation instead of decreasing it. Generally, we have noted the effect of syntactic features in the best model for both languages (Arabic, English).

Conclusion
We have presented a method for evaluating the Overall Responsiveness of text summary in both Arabic and English language. This method is based on a combination of ROUGE scores, Au-toSummENG scores, MeMoG scores, NPowER scores, SIMetrix scores and a variety of syntactic features. We have combined these features using a regression method. Before building the linear regression model, we select the relevant features using the "Wrapper subset evaluator" method. The selected method includes automatic metrics and syntactic features. And generally automatic features that take into account large context are selected (autosummeng 443 , ROUGE-3, etc). This confirms the hypothesis of (Conroy and Dang, 2008) which indicates that the integration of con-  To evaluate our method, we have compared the correlation of the best model (built with selected features) and of baselines with manual Overall Responsiveness.We have tested two methods of validation of predictive models : cross validation with 10 folds and supplied test set. The results show that, in both languages, the correlation of the best model with Overall Responsiveness is low, while it is more importante then the correlation of baselines. This may be due to the small set of the observations per language. We need a larger set of observations to determine the best combination of features and to have better correlation. Moreover, we note that the correlation using crossvalidation is more important than using supplied test set. The decrease of correlation between the cross-validation method and the supplied test set method needs to be studied further in future works.