Content Selection through Paraphrase Detection: Capturing different Semantic Realisations of the Same Idea

ive summarisation is one of the most challenging issues to address automatically, since it both requires deep language understanding and generation with a strong semantic component. For tackling this task, approaches usually need to define an internal representation of the text, that can be in the form of SVO triples (Genest and Lapalme, 2011), basic semantic units consisting of actor-action-receiver (Li, 2015), or using predarg structures (Khan et al., 2015). In this latter work, pred-arg structures extracted from different related documents are compared, so that common or redundant information can be grouped into clusters. For computing a similarity matrix, Wordnet1based similarity metrics are used, mainly relying on the semantic distance between concepts, given Wordnets’ hierarchy. On the other hand, previous works on linguistic structure mapping can be related to paraphrase identification (Fernando and Stevenson, 2008; Xu et al., 2015), as well as to pred-arg alignment (Wolfe et al., 2015; Roth and Frank, 2015). However, these works only use semantic similarity metrics based on WordNet or other semantic resources, such as ConceptNet2 or FrameNet3. https://wordnet.princeton.edu/ http://conceptnet5.media.mit.edu/ https://framenet.icsi.berkeley.edu/fndrupal/


Introduction
Summarisation can be seen as an instance of Natural Language Generation (NLG), where "what to say" corresponds to the identification of relevant information, and "how to say it" would be associated to the final creation of the summary. When dealing with data coming from the Semantic Web (e.g., RDF triples), the challenge of how a good summary can be produced arises. For instance, having the RDF properties from an infobox of a Wikipedia page, how could a summary expressed in natural language text be generated? and how could this summary sound as natural as possible (i.e., be an abstractive summary) far from only being a bunch of selected sentences output together (i.e., extractive summary)? This would imply to be able to successfully map the RDF information to a semantic representation of natural language sentences (e.g., predicate-argument (pred-arg) structures). Towards the long-term objective of generating abstractive summaries from Semantic Web data, the specific goal of this paper is to propose and validate an approach to map linguistic structures that can encode the same meaning but with different words (e.g., sentence-to-sentence, predarg-to-pred-arg, RDF-to-TEXT) using continuous semantic representation of text. The idea is to decide the level of document representation to work with; convert the text into that representation; and perform a pairwise comparison to decide to what extent two pairs can be mapped or not. For achieving this, different methods were analysed, including traditional Wordnet-based ones, as well as more recent ones based on word embeddings. Our approach was tested and validated in the context of document-abstract sentence mapping to check whether it was appropriate for identifying important information. The results obtained good performance, thus indicating that we can rely on the approach and apply it to further contexts (e.g., mapping RDFs into natural language).
The remainder of this paper is organised as follows: Section 2 outlines related work. Section 3 explains the proposed approach for mapping linguistic units. Section 4 describes our dataset and experiments. Section 5 provides the results and discussion. Finally, Section 6 draws the main conclusions and highlights possible futures directions.

Related Work
Abstractive summarisation is one of the most challenging issues to address automatically, since it both requires deep language understanding and generation with a strong semantic component. For tackling this task, approaches usually need to define an internal representation of the text, that can be in the form of SVO triples (Genest and Lapalme, 2011), basic semantic units consisting of actor-action-receiver (Li, 2015), or using predarg structures (Khan et al., 2015). In this latter work, pred-arg structures extracted from different related documents are compared, so that common or redundant information can be grouped into clusters. For computing a similarity matrix, Wordnet 1based similarity metrics are used, mainly relying on the semantic distance between concepts, given Wordnets' hierarchy.
On the other hand, previous works on linguistic structure mapping can be related to paraphrase identification (Fernando and Stevenson, 2008;Xu et al., 2015), as well as to pred-arg alignment (Wolfe et al., 2015;Roth and Frank, 2015). However, these works only use semantic similarity metrics based on WordNet or other semantic resources, such as ConceptNet 2 or FrameNet 3 .
The use of continuous semantic representation, and in particular the learning or use of Word Embeddings (WE) has been shown to be more appropriate and powerful approach for representing linguistic elements (words, sentences, paragraphs or documents) (Turian et al., 2010;Dai et al., 2015). Given its good performance, they have been recently applied to many natural language generation tasks (Collobert et al., 2011;Kågebäck et al., 2014). The work presented in (Perez-Beltrachini and Gardent, 2016) proposes a method to learn embeddings to lexicalise RDF properties, showing also the potential of using this type of representation for the Semantic Web.

Our Mapping Approach
Our approach mainly consists of three stages: i) identification and extraction of text semantic structures; ii) representation of these semantic structures in a continuous vector space; and iii) define and compute the similarity between two representations.
For the first stage, depending on the level defined for the linguistic elements (e.g., a clause, a sentence, a paragraph), a text processing is carried out, using the appropriate tools to obtain the desired structures (e.g., sentence segmentation, semantic role labelling, syntactic parsing, etc.). Then, in the second stage, we represent each structure through its WEs. If the structure consists of more than one element, we will compute the final vector as the composition of the WEs of each of the elements it contains. This is a common strategy that has been previously adopted, in which the addition or product normally lead to the best results (Mitchell and Lapata, 2008;Blacoe and Lapata, 2012;Kågebäck et al., 2014). Finally, the aim of the third stage is to define a similarity metric between the vectors obtained in the second stage.

Dataset and Approach Configuration
The English training collection of documents and abstracts from the Single document Summarization task (MSS) 4 of the MultiLing2015 was used as corpus. It consisted of 30 Wikipedia documents from heterogeneous topics (e.g., history of Texas University, fauna of Australia, or Magic Johnson) and their abstracts, which corresponded to the introductory paragraphs of the Wikipedia page. Documents were rather long, having 3,972 words on average (the longest document had 8,348 words and the shortest 2,091), whereas abstracts were 274 words on average (the maximum value was 305 words and the minimum 243), thus resulting in a very low compression ratio 5 -around 7%.
For carrying out the experiments, our approach receives document-abstract pairs as input. These correspond to the source documents, as well as the abstracts associated to those documents. Following the stages defined in Section 3, both were segmented in sentences, and the pred-arg structures were automatically identified using SENNA semantic role labeller 6 . Different configurations were tested as far as the WE and the similarity metrics were concerned for the second and third stages. For representing either sentences or pred-arg structures, GLoVe pre-trained WE vectors (Pennington et al., 2014) were used, specifically the ones derived from Wikipedia 2014 + Gigaword 5 corpora, containing around 6 billion tokens; and the ones derived from a Common Crawl, with 840 billion tokens. Regarding the similarity metrics, Wordnet-based metrics included the shortest path between synsets, Leacock-Chodorow similarity, Wu-Palmer similarity, Resnik similarity, Jiang-Conrath similarity, and Lin similarity, all of them implemented in NLTK 7 . For the WE settings, the similarity metrics were computed on the basis of the cosine similarity and the Euclidean distance. These latter metrics were applied upon the two composition methods for sentence embedding representations: addition and product, as described in (Blacoe and Lapata, 2012). In the end, a total of 38 distinct configurations were obtained.

Evaluation and Discussion
We addressed the validation of the source document-abstract pairs mapping as an extrinsic task using ROUGE (Lin, 2004). ROUGE is a wellknown tool employed for summarisation evaluation, which computes the n-gram overlapping between an automatic and a reference summary in terms of n-grams (unigrams -ROUGE 1; bigrams -ROUGE 2, etc.). Our assumption behind this type of evaluation was that considering the  Table 1: Results (in percentages) for the extrinsic validation of the mapping.
source document snippets of the top-ranked mapping pairs, and directly building a summary with them (i.e., an extractive summary), good ROUGE results should be obtained if the mapping was good enough. Table 1 reports the most relevant results obtained. As baselines, we considered the ROUGE direct comparison between the sentences (or predarg structures) of the source document and the ones in the abstract (TEXT baseline, and PRED-ARG baseline, respectively). We report the results for ROUGE-1, ROUGE-2 and ROUGE-SU4 8 . The results obtained show that representing the semantics of a sentence or pred-arg structure using WE leads to the best results, improving those from traditional WordNet-based similarity metrics. The best approach for the WE configuration corresponds to the addition composition method with cosine similarity, and using the pretrained WE derived from Wikipedia+GigaWord. Compared to the state of the art in summarisation, the results with WE are also encouraging, since previous published results with the same corpus (Alcón and Lloret, 2015) are close to 44% (Fmeasure for ROUGE-1).
Concerning the comparison between whether using the whole text with respect to only using the pred-arg structures, the former gets better results. This is logical since the more text to compare, the higher chances to obtain similar ngrams when evaluating with ROUGE. However, this also limits the capability of abstractive summarisation systems, since we would end up with selecting the sentences as they are, thus restricting the method to purely extractive. Nevertheless, the results obtained by the use of pred-arg structures are still reasonably acceptable, and this type of structure would allow to generalise the key content to be selected that should be later rephrased in a proper sentence, producing an abstractive sum-mary. Next, we provide the top 3 best pair alignments (source document-abstract) of the highest performing configuration using pred-arg structure as examples. The value in brackets mean the similarity percentage obtained by our approach.
protected areas -protected areas (100%) the insects comprising 75% of Australia 's known species of animals -The fauna of Australia consists of a huge variety of strange and unique animals ; some 83% of mammals, 89% of reptiles, 90% of fish and insects (99.94%) European settlement , direct exploitation of native faun , habitat destruction and the introduction of exotic predators and competitive herbivores led to the extinction of some 27 mammal, 23 bird and 4 frog species. -Hunting, the introduction of non-native species, and land -management practices involving the modification or destruction of habitats led to numerous extinctions (99.93%) Finally, our intuition behind the results obtained (maximum values of 50%) is that not all the information in the abstract can be mapped with the information of the source document, indicating that a proper abstract may contain extra information that provides from the world knowledge of its author.

Conclusion and Future Work
This paper presented an approach to automatically map linguistic structures using continuous semantic representation of sentences. The analysis conducted over a wide set of configuration showed that the use of WEs improves the results compared to traditional WordNet-based metrics, thus being suitable to be employed in data-to-text NLG approaches that need to align content from the Semantic Web to text in natural language. As future work, we plan to evaluate the approach intrinsically and apply it to map non-linguistic in-formation (e.g., RDF) to natural language. We would also like to use the proposed method to create training positive and negative instances to learn classification models for content selection.