Phrase Generalization: a Corpus Study in Multi-Document Abstracts and Original News Alignments

Content can be expressed at different levels of speciﬁcity, varying the amount of detail presented to the reader. The need to transform speciﬁc content into more general form naturally arises in summarization, where people and machines need to convey the gist of a text within imposed space constraints. Completely removing sentences and phrases is one way to reduce the level of detail. The bulk of work on summarization content selection and compression deal with these tasks. In this paper, we present a corpus study on a more subtle and under-studied phenomenon: noun phrase generalization. Based on multi-document news and abstract alignments at the phrase level, we arrive at a ﬁve category classiﬁcation scheme and ﬁnd that the most common category requires semantic interpretation and inference. The others rely on lexical substitution or deletion of details from the original expression. We provide a systematic analysis, elucidating the capabilities needed for automating the generation of more general or more speciﬁc references.


Introduction
Summarization involves a number of complex transformations to condense the gist of a text into a short summary . One of these transformations is changing the amount of detail in the original news texts. Removing entire sentences is one of the fairly wellunderstood ways for changing the amount of detail. Which sentences to remove can be decided in a system's content selection module by a number of competitive approaches (Gillick and Favre, 2009;Lin and Bilmes, 2011;Kulesza and Taskar, 2011). Similarly, one can perform sentence compression, removing words or phrases from a sentence in the original text to form a summary sentence (Knight and Marcu, 2000;Riezler et al., 2003;Turner and Charniak, 2005;McDonald, 2006;Galley and McKeown, 2007;Cohn and Lapata, 2008) or perform sentence selection and compression jointly (Berg-Kirkpatrick et al., 2011).
In this paper, we focus our attention on a much finer level to study the changes of specificity on the phrase level. The existence of these changes have been documented in prior work (Jing and McKeown, 2000;Marsi and Krahmer, 2010). Jing and McKeown (2000) analyzed 30 single document articles and their summaries and characterized the transformations performed on the original text to form a summary. They did not give statistics about the relative frequency of each transformation operation but list "add descriptions or names for people and organizations" and "substitute phrases with more general or specific information" as two of the summarization operations. In a more recent study, Marsi and Krahmer (2010) analyzed the phrase alignment between original spoken news in a Dutch television news program and the subtitles for the same broadcast. They aligned the transcript and the subtitles and analyzed the transformations performed on the phrase level. The authors distinguished five mutually exclusive similarity relations in the corpus: equals (the aligned phrases are identical), restastes (the aligned phrases convey the same information but with different wording), specifies (the subtitle phrase is more specific than the transcript phrase), generalizes (the subtitle phrase is more general than the transcript phrase), and intersects (the aligned phrases share some informational content, but each also expresses some information not expressed in the other). The second most frequent class is generalizes 1 . In about 14% of the aligned phrases, the subtitle contained a more general phrase than the original. Only a small percentage of specifies pairs is present: in about 3% of the phrases the subtitles were more specific than the transcripts.
Here we present an analysis of generalization operations that occur in abstracts produced for clusters of topically related news articles in Brazilian Portuguese. In the vast majority of cases these require transformations at the phrase level. We observed five types of generalization: interpretation, detail removal, class, role, and whole. Named entity (NE) generalizations, in particular, belong to four categories: detail removal (removing some of the information contained in the original article, similar to compression on the phrase level), role (substituting a reference by name with a reference by the role the entity plays in the described events), class (substituting a reference with a superordinate concept, i.e. "swimmer -"athlete") , and whole (a reference to a member of a group or area is substituted by a reference to the whole, i.e. "Jamaica" -"the Caribbean". In each category, we identified a set of syntactic-semantic operations related to each type of named entities (person, organization, location and sports event). Such operations include substitutions and phrase reductions. Their automation would require the development of capabilities that are not available to current systems.
The remainder of this paper is structured as follows. In Section 2, we introduce our corpus, explaining the manual alignment between the human abstract and the multiple news text inputs, and the pre-processing of such alignments. In Section 3, we describe the analysis of the alignment pairs containing generalization and the categorization of each instance in according to the five-class typology of transformations. Then, in Section 4, our focus relies on the generalization of phrases containing named entities. Specifically, we describe the syntactic and semantic properties of such phrases considering both the different types of generalization and entities. In Section 5, we discuss what we learned and close with discussion of perspectives for automatic summarization.
1 Equals is the most common relation between aligned phrases, accounting for 67% of the alignments

The Corpus
We used the CSTNews (Cardoso et al., 2011) corpus of multi-document abstracts and the associated news articles. The corpus comprises 50 clusters of news texts in Brazilian Portuguese from a range of categories: daily news (14), world (14), domestic politics (10), sports (10) , economy (1), and science (1). There are 140 documents in total in the corpus.
Each cluster contains two or three news articles on the same topic, with 42 sentences per cluster on average. There are six manual multi-document abstracts for each cluster. The abstract-writers were instructed to produce abstracts of length equal to 30% of the longest article in the cluster. The resulting abstracts were on average seven sentences (132 words) long. CSTNews has annotated versions of the source texts and summaries in different linguistic levels, e.g., intra-and inter-textual discourse relations, classification of temporal expressions, semantic annotation of nouns and verbs, and subtopic segmentation. The corpus also contains alignments between each human abstract and the source texts at the sentence level. Each sentence in the abstract is associated with all of the sentences in the original articles that support the information expressed in the abstract.
For our work, we use the existing manual annotations, pairing sentences from the abstract with their corresponding sentences in the original article (Camargo et al., 2013). The annotators identified 1,007 alignments, involving 334 summary sentences and 877 document sentences: 99.4% of the summary sentences were aligned to some document sentence and 42.43% of the document sentences were aligned to some some summary sentence.
In addition, for each pair of summary-original sentences, annotators included labels describing the sub-sentential relations between the sentences in the pair. Among other tags, the annotators labeled when a summary sentence contained parts that were more general or more specific than the semantically corresponding part in the document sentence. They however did not mark the exact spans of text involved in the generalization.
The alignment in (1) shows an example of a summary and document sentence that share information and in which one can observe changes in the specificity of reference. The summary sentence has more general content, referring to "many states" and "the operation" while the document sentence has a list of Brazilian states and the name of the police investigation (shown in bold).
(1) Summary: Overall, 13% of summary-document pairs involved a generalization or a specification operation. There are 80 pairs tagged as containing generalization and 47 pairs tagged as containing specification (Camargo et al., 2013). The label describes the change that occurred to transform the document sentence into the summary sentence, i.e. generalization means some information is expressed in more general terms in the summary sentence than it was in the original document sentence.

Pre-Processing Steps
With the aim categorizing the type of every generalization case in the summary-documents alignments, we performed two manual pre-processing steps: (i) expansion and revision of the alignments with generalization, and (ii) delimitation of the generalization cases and indexing of the textual spans involved in each case.
Abstracts contained both generalizations and specifications of entities. Assuming that the underlying process involved in modifying the reference is the same in both cases, we augment our corpus of generalizations by "inverting" the 47 specification alignments to obtain 47 examples of generalization, as illustrated in (2). The pair is from a news article about the schedule of the Brazilian men's volleyball team. It was originally tagged as specification, since the summary sentence contains more detail than the original; it details that the team aim is to win "the gold medal". We swap the direction of the relation between the sentences and consider the resulting sentences as examples of generalization.
In this way, we obtained a set of 127 pairs of aligned sentences with differences in the speci-ficity of reference. Next, each alignment was manually revised by the first author: 12 of them were excluded because the author did not find clear portions of the summary sentence that generalize information expressed in the original document. An example of sentence that was excluded is given in (3). The final set consists of a total of 115 aligned pairs.

Typology of Transformations
Further, we iteratively analyzed the types of the 136 cases of generalization to come up with categories that cover all examples in the corpus. We converged on a classification scheme with five categories: (i) Interpretation, i.e., generalization based on sophisticated inferences over the source text and additional information such as transforming "200 people were injured" to "the human toll was high"; (ii) Detail removal, i.e, generalization by omitting details of a specific textual segment; (iii) Role, i.e., replacement of person entities by their title or role; (iv) Class, i.e., substitution of a subordinate concept by a superordinate one, and (v) Whole, i.e., concepts representing parts are replaced by concepts that indicate the whole. The typology reveals that humans carry out a variety of inferences based on rich world and domain knowledge to produce generic information. Table 1 shows the distribution of the categories divided by clause and phrase levels.
Interpretation is the most frequent category in the corpus (45.6%) and the only one that occurs in both clause and phrase levels. However, 83.8% of the cases (52 out of 62) occur at the clause level and involve propositional generalizations. We show an example in (5)   an inference that "stocking food, water, flashlights and candles" is a preparedness activity against hurricane. Detail removal is the second most frequent, with 32 instances (23.5%), followed by Role, with 18 instances (13.2%). The distribution of cases in Class and Whole is quite similar, 13 (9.6%) and 11 (8.1%), respectively. Next, we turn our description to generalizations that occur at the phrase level 2 , specifically to those involving named entities.

Named Entity Generalization
We first computed the number of cases that involve named entities or general NPs per category. Table 2 shows the results. Looking briefly at the 33 common noun pairs, we found that Interpretation tends to be associated with numbers (25%). The substitution of "about 300 buildings" with "many buildings" illustrates this. Interpretation also results from different inferences, e.g., when a cause (e.g., 'the fog") is replaced by its effect (e.g., "the bad weather"). The Role case where "the 16 children and 14 adults" was replaced with "the 30 hostages" is the only one involving generation of a numeric expression. Detail removal occurs by deleting noun adjuncts (shown in italics) (e.g., "a university campus") or complements (e.g., "the inspection of income tax declarations").
According to Table 3, there are three types of Whole generalizations for locations that solely involve names: (i) island-to-region, such as the replacement of "Haiti" and "Dominican Republic" with "the Caribbean"; (ii) city-to-state, such as "Maceió, which was substituted by "Alagoas", and (iii) city-to-country, such as the replacement of "Boston" with "United States". There is also one particular type of Detail removal by deleting names from phrases of the form premodifier + name (e.g., "the capital Kingston") to produce mentions whose head was the modifying noun of the specific phrase ("the capital"). Location names, specifically MWEs (e.g., "International Airport of São Paulo") that are made up of a place (possibly a MWE itself, such as "São Paulo") and additional information (e.g., "international"), are also replaced with common nouns ("the airport"). The replacement of such proper names with common nouns result from removal of all the details about the referent description.
Organization names are mostly generalized by means of common nouns that express class or whole. The substitution of "Brazil" with "the country" illustrates the Class category. The Whole generalization occurs through member-to-organization substitution, i.e., the replacement of "the Military Police Shock Troop" with "the police" illustrates this. The only case of name generalization is the substitution of "the Archdiocese of Los Angeles" with "the Catholic Church". There are also cases where names followed by acronyms in parenthesis, such as "National Institute of Social Security (INSS)", are reduced to the acronym only.
It can be seen that document mentions to people have different head types: full name, first name, last name, and acronym. With the exception of acronyms, the heads usually occur with two types of pre-modifiers (shown in italics): titles (e.g.,"president of the Senate, Renan Calheiros") and roles (e.g., "the goalkeeper Vieri"). In general, the document mentions are commonly replaced with common nouns only. The substitution of the first name "João Pedro" with "the senator" illustrates this. The summary writers also chose the modifying noun (shown in italics) from phrases of the form pre-modifier + name (e.g., "the goalkeeper Vieri") for generalization, deleting the last or fulll name (e.g., "the goalkeeper").
The reduction of full name by deleting surname (shown in italics) (e.g., "Renan Calheiros"), yielding phrases containing first name only (e.g., "Renan"), is another common type of operation. The case that belong to the Whole category is the only one involving two different types of named entity. In particular, "President of the United States, George Bush" was substituted by "Washington", in a person-to-place operation.

Discussion
This study provides an initial characterization for phrase generalizations that arise in summarization. It is evident that our results should be validated on a larger sample of summarization data. Nevertheless our findings can be seen as a good start for understanding the phenomenon. One of the practical outcomes from our work is the generalization typology which can be applied for the analysis of other data.
Interpretation is the most common category, resulting from inferences over propositions and covering a variety of operations. Its automatic treatment would be a major endeavor in natural language processing research because it is at the intersection of semantic interpretation and text generation.
Another challenge for summarization systems is how to deal with mentions of numbers, which form a special class of the interpretation transformation. We found that references to date, time, and general quantities accounted for 25% (8 out of 33 instances) of common noun phrase alignments in our corpus. Only in one case the numeric expression was transformed in an alternative numeric expression. All other phrases involving numbers were lexicalized alternatively. Then the task of a system would be to identify which references to numbers should be generalized and how to generate the generalization of numbers.
In our study, 61% of the generalizations involve operations over specific mentions to named entities. These have been studied computationally in the past, to predict the appropriate form of the name in references to people (Siddharthan et al., 2011) and to exploit the person name repetition in the summary to find the salience of entities (Dunietz and Gillick, 2014). Neither of these prior studies analyzed reference to named entities by common noun, which we provided in the analysis of our data, nor do they look at non-person refer-ences. In fact, substituting names with generic nouns was the most common operation in our data and it calls for the development of new capabilities, both to decide which entities should be mentioned generically and how to generate the reference itself.
Moreover, specific mentions to sports events, locations and organizations do not include modification in 88% (22 out of 25) of the pairs. Specific mentions to people have an accompanying description in around half the cases (57.6%). The occurrence of a pre-modifying word that identify the person's title or role provides more details about the referent. Thus, such mentions have a higher level of specificity than other with name only. Moreover, only few generic phrases contain a name, and, when it occurs, the names have particular types, e.g., first name in the case of people, and acronyms for organization.
On the operations concerning named entities, we provide some insights for substitution and reduction approaches to obtain general phrases.
Substitution is the most common operation (76.5%) (out of 51 cases), and its automatic process would require structured knowledge that includes at least three relationships: (i) is-a to express the rough notion of "a kind of", (ii) part-whole to express island-to-region, city-to-state, city-to-country, member-to-organization, and person-to-place, and (iii) instance-role, for entities of the person category. Since such knowledge is very particular to some domains, specially global and local sports, politics, and geography, we believe that it would possible to model it in handcrafted lexicons. It could also be derived automatically for some types of reference (McKinlay and Markert, 2011;Mitchell et al., 2015). In addition, modules to decide when substitution is necessary or appropriate would be needed.
Phrase reduction (i.e., deletion of words or phrases) occurs in 23.5% of the cases (out of 51). Although it is less frequent, detail removal include cases where specific phrases could be automatically converted into general in a more feasible way. This observation is based on the fact that summary phrases are made up of linguistic material that came from the document phrases. Thus, we can conceive phrase reduction as a similar task to sentence compression, where the oper-ations are learned by analyzing pair of sentences, one from the source text, and other from humanwritten abstracts such that they both have the same content. Specifically, 4 reduction rules could be defined: (i) removing pre-modifier from phrase of the form modifier + location name, yielding a common noun mention; (ii) removing name from phrase of the form organization name + parenthetical acronym, generating an mention with acronym only; (iii) deleting name from phrase of the form title/role + person name, producing a common noun mention, and (iv) removing surname from person full name, generating a first name mention.
We may also contribute for generating references, since referring expressions in extracts can be problematic because the sentences compiled from different documents might contain too little, too much, or repeated information about the referent. Our results show that 76.5% of the 51 generalizations with named entities (e.g., "the coach Bernardinho") are made solely with a common noun phrase (without the inclusion of the entity's name) (e.g., "the coach"), and thus a task to be considered is the generation of common noun references to named entities. Such generation would allow the production of a more natural summary.
We are aware of full coreference resolution is a very difficult problem and there are no systems that can reliably perform it on free texts. But we believe that the availability of cross-document information can facilitate the resolution of common noun phrases. This assumption is built on the fact that most common nouns in summary phrases were contained in the input texts. For example, the head of the summary NP "the coach", which generalizes the name "Bernardinho", is contained in a different sentence of the same input, as part of the mention "the coach Bernardinho". This means that lexical overlap would indicate that these three NPs refer to the same entity. Common noun generation would increase the genericity level of summaries, and avoid the repetition of forms produced by some rewriting methods (Siddharthan et al., 2011).

Future work
Our research both provides a preliminary characterization of generalization in document-summary alignments and a discussion of some insights for Natural Language Processing. For future work, we plan to increase the sample of specific-generic pairs by aligning the five new abstracts recently added to each cluster of CSTNews in order to validate our results. We could repeat the manual alignment or use automatic methods (Agostini et al., 2014). To identify the categories, we intend to carry out a manual annotation with multiple judges.
Moreover, we have been performing a manual annotation of coreference chains that consist of all the mentions of an entity in abstracts with different lengths in two languages, Portuguese and English. Our goal is to explore human preferences in mention realization, and possible differences across languages. We also aim at exploring whether the abstract length has influence on the syntactic forms and sequences of mentions, and on the amount of information included in the mentions.