AllSummarizer system at MultiLing 2015: Multilingual single and multi-document summarization

In this paper, we evaluate our automatic text summarization system in multilingual context. We participated in both single document and multi-document summarization tasks of MultiLing 2015 work-shop. Our method involves clustering the document sentences into topics using a fuzzy clustering algorithm. Then each sentence is scored according to how well it covers the various topics. This is done us-ing statistical features such as TF, sentence length, etc. Finally, the summary is constructed from the highest scoring sentences, while avoiding overlap between the summary sentences. This makes it language-independent, but we have to afford preprocessed data ﬁrst (tokenization, stemming, etc.).


Introduction
A document summary can be regarded as domainspecific or general-purpose, using the specificity as classification criterion (Hovy and Lin, 1998). We can, also, look at this criterion from language angle: language-specific or language-independent summarization. Language-independent systems can handle more than one language. They can be partially language-independent, which means they use language-related resources, and therefore you can't add a new language so easily. Inversely, they can be fully language-independent.
Recently, multilingual summarization has received the attention of the summarization community, such as Text Analysis Conference (TAC). The TAC 2011 workshop included a task called "Mul-tiLing task", which aims to evaluate languageindependent summarization algorithms on a variety of languages . In the task's pilot, there were seven languages covering news texts: Arabic, Czech, English, French, Greek, Hebrew and Hindi, where each system has to participate for at least two languages. MultiLing 2013 workshop is a community-driven initiative for testing and promoting multilingual summarization methods. It aims to evaluate the application of (partially or fully) language-independent summarization algorithms on a variety of languages. There were three tasks: "Multi-document multilingual summarization" (Giannakopoulos, 2013), "Multilingual single document summarization" (Kubina et al., 2013) and "Multilingual summary evaluation". The multi-document task uses the 7 past languages along with three new languages: Chinese, Romanian and Spanish. The single document task introduces 40 languages. This paper contains a description of our method (Aries et al., 2013) which uses sentences' clustering to define topics, and then trains on these topics to score each sentence. We will explain each task in the system (AllSummarizer), especially the preprocessing task which is languagedependent. Then, we will discuss how we fixed the summarization's hyper-parameters (threshold and features) for each language. The next section (Section 5) is reserved to discuss the experiments conducted in the MultiLing workshop. Finally, we will conclude by discussing possible improvements.

Related works
Clustering has been used for summarization in many systems, either using documents as units, sentences or words. The resulted clusters are used to extract the summary. Some systems use just the biggest cluster to score sentences and get the top ones. Others take from each cluster a representative sentence, in order to cover all topics. While there are systems, like ours, which score sentences according to all clusters. "CIST" (Liu et al., 2011;Li et al., 2013) is a system which uses hierarchical Latent Dirichlet Allocation topic (hLDA) model to cluster sentences into sub-topics. A sub-topic containing more sentences is more important and therefore those containing just one or two sentences can be neglected. The sentences are scored using hLDA model combined with some traditional features. The system participated for multi-document summarization task, where all documents of the same topic are merged into a big text document.
Likewise, "UoEssex"  uses a clustering method (K-Means) to regroup similar sentences. The biggest cluster is used to extract the summary, while other clusters are ignored. Then, the sentences are scored using their cosine similarities to the cluster's centroid. The use of the biggest cluster is justified by the assumption that a single cluster will give a coherent summary.
The scoring functions of these two systems are based on statistical features like frequencies of words, cosine similarity, etc.
In the contrary, systems like those of Conroy et al. (2011) ("CLASSY"),  ("SIEL IIITH"), El-Haj and Rayson (2013), etc. are corpus-based summarizers, which can make it hard to introduce new languages. "CLASSY" uses naïve Bayes to estimate the probability that a term may be included in the summary. The classifier was trained on DUC 2005-2007 data. As for backgrounds of each language, Wikinews are used to compute Dunning G-statistic. "SIEL IIITH" uses a probabilistic Hyperspace Analogue to Language model. Given a word, it estimates the probability of observing another word with it in a window of size K, using a sufficiently large corpus. El-Haj and Rayson (2013) calculate the log-likelihood of each word using a corpus of words frequencies and the multiLing'13 dataset. The score of each sentence is the sum of its words' log-likelihoods.
In our method (Aries et al., 2013), we use a simple fuzzy clustering algorithm. We assume that a sentence can express many topics, and therefore it can belong to many clusters. Also, we believe that a summary must take in consideration other topics than the main one (the biggest cluster). To score sentences, we use a scoring function based on Naïve Bayes classification. It uses the clusters for training rather than a corpus, in order to avoid the problem of language dependency.

System overview
One of multilingual summarization's problem is the lack of resources such as labeled corpus used for learning. Learning algorithms were used either to select the sentences that should be in the summary, or to estimate the features' weights. Both cases need a training corpus given the language and the domain we want to adapt the summarizer to. To design a language-neutral summarization system, either we adapt a system for input languages (Partly language-neutral), or we design a system that can process any language (Fully language-neutral).
Our sentence extraction method can be applied to any language without any modifications, affording the pre-process step of the input language. To do this, we had to find a new method to train our system other than using a corpus (language and topic dependent). The idea was to find different topics in the input text using similarity between sentences. Then, we train the system using a scoring function based on Bayes classification algorithm and a set of features to find the probability of a feature given the topic. Finally, we calculate for each sentence a score that reflects how it can represent all the topics.
In our previous work (Aries et al., 2013), our system used only two features which have the same nature (TF: uni-grams and bi-grams). When we add new features, this can affect the final result (summary). Also, our clustering method lies on the clustering threshold which has to be estimated somehow. To handle multi-document summarization, we just fuse all documents in the same topic and consider them as one document. Figure  1 represents the general architecture of AllSummarizer 1 .

Preprocessing
This is the language-dependent part, which can be found in many information retrieval (IR) works. In our system, we are interested in four preprocessing tasks: • Normalizer: in this step, we can delete special characters. For Arabic, we can delete diacritics (Tashkiil) if we don't need them in the process (which is our case).
• Stemmer: The role of this task is to delete suffixes and prefixes so we can get the stem of a word.
• Stop-Words eliminator: It is used to remove the stop words, which are the words having no signification added to the text.
In this work, normalization is used just for Arabic and Persian to delete diacritics (Tashkiil). Concerning stop-word elimination, we use precompiled word-lists available on the web. Table  1 shows each language and the tools used in the remaining pre-processing tasks.

Topics clustering
Each text contains many topics, where a topic is a set of sentences having some sort of relationship between each other. In our case, this relationship is the cosine similarity between each two sentences. It means, the sentences that have many terms in common are considered in the same topic. Given two sentences X and Y , the cosine similar- Where x i (y i ) denotes frequencies for each term in the sentence X (Y ).
To generate topics, we use a simple algorithm (see algorithm 1) which uses cosine similarity and a clustering threshold th to cluster n sentences.

Scoring function
A summary is a short text that is supposed to represent most information in the source text, and cover most of its topics. Therefore, we assume that a sentence s i can be in the summary when it is most probable to represent all topics (clusters) c j ∈ C using a set of features f k ∈ F . We used Naïve Bayes, assuming independence between different classes and different features (a sentence can have multiple classes). So, the score of a sentence s i is the product over classes of the product over features of its score in a specific class and feature (see equation. 2).
The score of a sentence s i in a specific class c j and feature f k is the sum of probability of the feature's observations when s i ∈ c j (see equation. 3). We add one to the sum, to avoid multiplying by a features' score of zero.
Where φ is an observation of the feature f k in the sentence s i . For example, assuming the feature f 1 is term frequency, and we have a sentence: "I am studying at home.". The sentence after pre-processing would be: s 1 = {"studi"(stem of "study"), "home"}. So, φ may be "studi" or "home", or any other term. If we take another feature f 2 which is sentence position, the observation φ may take 1st, 2nd, 3rd, etc. as values.
Each feature divides the sentences to several categories. For example, if we have a text written just with three characters: a, b and c, and the feature is the characters of the text, then we will have three categories. Each category has a probability to occur in a cluster, which is the number of its appearance in this cluster divided by all cluster's terms, as shown in equation 4.
Where f is a given feature. φ and φ are observations (categories) of the feature f . C is the set of clusters.

Unigram term frequency
This feature is used to calculate the sentence pertinence depending on its terms. Each term is considered as a category.

Bigram term frequency
This feature is similar to unigram term frequency, but instead of one term we use two consecutive terms.

Sentence position
We want to use sentence positions in the original texts as a feature. The position feature used by Osborne (2002) divides the sentences into three sets: the ones in the 8 first paragraphs, those in last 3 paragraphs and the others in between. Following the assumption that the first sentences and last ones are more important than the others. Three categories of sentence positions seem very small to express the diversity between the clusters. Instead of just three categories, we divided the position space into 10 categories. So, if we have 20 sentences, we will have 2 sentences per category.

Sentence length
One other feature applied in our system is the sentence length (number of words), which is used originally to penalize the short sentences. Following a sentence's length, we can put it in one of three categories: sentences with length less than 6 words, those with length more than 20 words, and those with length in between Osborne (2002).
Like sentence position, three categories is a small number. Therefore, we used each length as a category. Suppose we have 4 sentences which the lengths are: 5, 6, 5 and 7, then we will have 3 categories of lengths: 5, 6 and 7.
In our work, we use two types of sentence length: • Real length (RLeng): which is the length of the sentence without removing stop-words.
• Pre-processed length (PLeng): which is the length of the sentence after pre-processing.

Summary extraction
To extract sentences, we reorder them decreasingly using their scores. Then we extract the first non similar sentences until we get the wanted size (see algorithm 2).

Summarization parameters
In this section, we describe how the summarization parameters have been chosen. The first parameter is the clustering threshold, which will lead to few huge clusters if it is small, and inversely. The clustering threshold is used with sentences' similarities to decide if two sentences are similar or not. Our idea is to use statistic measures over those similarities to estimate the clustering threshold. Eight measures have been used: • The median Algorithm 2: extraction method Data: input text Result: a summary add the first sentence to the summary; foreach sentence in the text do calculate cosine similarity between this sentence and the last accepted one; if the simularity is under the threshold then add this sentence to the summary; end if the sum of the summary size and the current sentence's is above the maximum size then delete this sentence from the summary; end end • The mean • The mode which can be divided to two: lower mode and higher mode, since we can have many modes.
• The variance Where, |s| is the number of different terms in a sentence s. |D| is the number of different terms in the document D. n is the number of sentences in this document.
The second parameter is the features' set, which is the combination of at least one of the five features described in section 3.4. We want to know which features are useful and which are not for a given language.
To fix the problem of the clustering threshold and the set of features, we used the training sets provided by the workshop organizers. For each document (or topic in multi-document), we generated summaries using the 8 measures of th, and different combinations of the scoring features. Then, we calculated the average ROUGE-2 score for each language. The threshold measure and the set of features that maximize this average will be used as parameters for the trained language. Table 2 represents an example of the 10 languages and their parameters used for both tasks: MSS and MMS. We have to point out that the average is not always the best choice for the individual documents (or topic in multi-document). For example, in MSS, there is a document which gives a ROUGE-2 score of 0.28 when we use the parameters based on average scores. When we use the mean as threshold and just TFB as feature for the same document, we get a ROUGE-2 score of 0.31.

Experiments
We participated in all workshop's languages, either in single document or multi-document tasks. To compare our system to others participated systems, we followed these steps (for every evaluation metric): • For each system, calculate the average scores of all used languages.
• For our system, calculate the average scores of used languages by others. For example, BGU-SCE-M team uses Arabic, English and Hebrew; We calculate the average of scores of these languages for this system and ours.
• Then, we calculate the relative improvement using the averages oursystem−othersystem othersystem .

Evaluation metrics
In "Single document summarization" task, ROUGE (Recall-Oriented Understudy for Gisting Evaluation) (Lin, 2004) is used to evaluate the participated systems. It allows us to evaluate automatic text summaries against human made abstracts. The principle of this method is to compare N-grams of two summaries based on the number of matches between these two based on the recall measure. Five metrics are used: ROUGE-1, ROUGE-2, ROUGE-3, ROUGE-4 and ROUGE-SU4.

Single document summarization
Besides our system (AllSummarizer), there are two more systems which participated in all 38 languages (EXB and CCS). Table 3 shows the comparison between our system and the other systems in single document task, using the relative improvement.
Looking at these results, our system took the fifth place out of seven participants. It outperforms the Lead baseline. It took the last place out of three participants in all 38 languages.

Multi-document summarization
Besides our system (AllSummarizer), there are 4 systems that participated with all the 10 languages. Table 4 shows a comparison between our system and the other systems in multi-document task, using the relative improvement. We used the parameters fixed for single document summarization to see if the same parameters are applicable for both single and multi-document summarizations.
Looking to the results, our system took the seventh place out of ten participants. When we use single document parameters, we can see that it doesn't outperform the results when using the parameters fixed for multi-document summarization. This shows that we can't use the same parameters for both single and multi-document summarization.

Conclusion
Our intension is to create a method which is language and domain independent. So, we consider the input text as a set of topics, where a sentence can belong to many topics. We calculated how much a sentence can represent all the topics. Then, the score is used to reorder the sentences and extract the first non redundant ones. We tested our system using the average score of all languages, in single and multi-document summarization. Compared to other systems, it affords fair results, but more improvements have to be done in the future. We have to point out that our system participated in all languages. Also, it is easy to add new languages when you can afford tokenization and stemming.
We fixed the parameters (threshold and features) based on the average score of ROUGE-2 of all training documents. Further investigations must be done to estimate these parameters for each document based on statistical criteria. We want to investigate the effect of the preprocessing step and the clustering methods on the resulted summaries. Finally, readability remains a challenge for extractive methods, especially when we want to use a multilingual method.