Sentence Embedding Evaluation Using Pyramid Annotation

Word embedding vectors are used as input for a variety of tasks. Choosing the right model and features for producing such vectors is not a trivial task and different embedding methods can greatly affect results. In this paper we re-purpose the "Pyramid Method" annotations used for evaluating automatic summarization to create a benchmark for comparing embedding models when identifying paraphrases of text snippets containing a single clause. We present a method of converting pyramid annotation files into two distinct sentence embedding tests. We show that our method can produce a good amount of testing data, analyze the quality of the testing data, perform test on several leading embedding methods, and finally explain the downstream usages of our task and its significance.


Introduction
Word vector embeddings [Mikolov et al. 2013] have become a standard building block for NLP applications. By representing words using continuous multi-dimensional vectors, applications take advantage of the natural associations among words to improve task performance. For example, POS tagging [Al Rfou et al. 2014], NER [Passos et al. 2014], parsing [Bansal et al. 2014], Semantic Role Labeling [Herman et al. 2014] or sentiment analysis [Socher et al. 2011] -have all been shown to benefit from word embeddings, either as additional features in existing supervised machine learning architectures, or as exclusive word representation features. In deep learning applications, word embeddings are typically used as pre-trained initial layers in deep architectures, and have been shown to improve performance on a wide range of tasks as well (see for example, [Cho et al., 2014;Karpathy and Fei-Fei 2015;Erhan et al,. 2010]).
One of the key benefits of word embeddings is that they can bring to tasks with small annotated datasets and small observed vocabulary, the capacity to generalize to large vocabularies and to smoothly handle unseen words, trained on massive scale datasets in an unsupervised manner.
Training word embedding models is still an art with various embedding algorithms possible and many parameters that can greatly affect the results of each algorithm. It remains difficult to predict which word embeddings are most appropriate to a given task, whether fine tuning of the embeddings is required, and which parameters perform best for a given application.
We introduce a novel dataset for comparing embedding algorithms and their settings on the specific task of comparing short clauses. The current state-of-the-art paraphrase dataset [Dolan and Brockett, 2005] is quite small with 4,076 sentence pairs (2,753 positive). The Stanford Natural Language Inference (SNLI) (Bowman et al., 2015) corpus contains 570k sentences pairs labeled with one of the tags: entailment, contradiction, and neutral. SNLI improves on previous paraphrase datasets by eliminating indeterminacy of event and entity coreference which make human entailment judgment difficult. Such indeterminacies are avoided by eliciting descriptions of the same images by different annotators.
We repurpose manually created data sets from automatic summarization to create a new paraphrase dataset with 197,619 pairs (8,390 positive and challenging distractors in the negative pairs). Like SNLI, our dataset avoids semantic indeterminacy because the texts are generated from the same news reportswe thus obtain definite entailment judgments but in the richer domain of news report as opposed to image descriptions. The propositions in our dataset are on average 12.1 words long (as opposed to about 8 words for the SNLI hypotheses).
In addition to paraphrase, our dataset captures a notion of centrality -the clause elements captured are Summary Content Units (SCU) which are typically shorter than full sentences and intended to capture proposition-level facts. As such, the new dataset is relevant for exercising the large family of "Sequence to Sequence" (seq2seq) tasks involving the generation of short text clauses [Sutskever et al. 2014].
The paper is structured as follows: §2 describes the pyramid method; §3 describes the process for generating a paraphrase dataset from a pyramid dataset; in §4, we evaluate a number of algorithms on the new benchmark and in §5, we explain the importance of the task.

The Pyramid Method
The Pyramid Method (Nenkova and Passonneau, 2004) is a summarization evaluation scheme designed to achieve consistent score while taking into account human variation in content selection and formulation. This evaluation method is manual and can be applied to both manual and auto-matic summarization. It has been included as a main evaluation technique in all DUC datasets since 2005 (Passonneau et al., 2006).
In order to use the method, a pyramid file must first be created manually ( Fig. 1):  Create a set of model (gold) summaries  Divide each summary into Summary Content Units (SCUs) -SCUs are key facts extracted from the manual summarizations, they are no longer than a single clause  A pyramid file is created where each SCU is given a score by the number of summaries in which it is mentioned (i.e., SCUs mentioned in 3 summaries will obtain a score of 3) After the pyramid is created, it can be used to evaluate a new summary:  Find all the SCUs in the summary  Sum the score of all the found SCUs and divide it by the maximum score that the same amount of SCUs can achieve SCUs are extracted from different source summaries, written by different authors. When counting the number of occurrences of an SCU, annotators effectively create clusters of text snippets that are judged semantically equivalent in the context of the source summaries. SCUs actually refer to clusters of text fragments from the summaries and a label written by the pyramid annotator describing the meaning of the SCU.
In our evaluation, we divert the pyramid file from its original intention of summarization evaluation, and propose to use it as a proposition paraphrase dataset.

Repurposing Pyramid Annotations
We define two types of tests that can be produced from a pyramid file: a binary decision test and a ranking test. For the binary decision test, we collect pairs of different SCUs from manual summaries and the label given to the SCU by annotators. The binary decision consists of deciding whether the pair is taken from the same SCU. In order to make the test challenging and  For non-paraphrase pairs, both items must match on more than 3 words;  Both items must not include any pronouns;  The pair must be lexically varied (at least one content word must be different across the items)  For the ranking test, we generate a set of multiple choice questions by taking as a question an SCU appearance in the text and the correct answer is another appearance of the same SCU in the test.
To create synthetic distractors, we use the 3 most lexically similar text segments from distinct SCUs: Morris Dees co-founded the SPLC: 1.

2.
Dees and the SPLC seek to destroy hate groups through multimillion dollar civil suits that go after assets of groups and their leaders 3.
Dees and the SPLC have fought to break the organizations by legal action resulting in severe financial penalties 4.
The SPLC participates in tracking down hate groups and publicizing their activities in its Intelligence Report

Baseline Embeddings Evaluation
In order to verify that this task indeed is sensitive to differences in word embeddings, we evaluated 8 different word embeddings on the task as a baseline: Random, None (One-Hot em-bedding), word2vec (Mikolov et al., 2013) trained on Google News and two models trained on Wikipedia with different window sizes , word2vec trained with Wikipedia dependencies , GloVe (Pennington et al., 2014) and Open IE based embeddings (Stanovsky et al., 2015). For all of the embeddings, we measured sentence similarity as the cosine similarity 1 of the normalized sum of all the words in the sentences.
For the binary decision test, we evaluated the embedding by finding a threshold for answering where a pair is a paraphrase that maximizes the F-measure (trained over 10% the dataset and tested on the rest) of the embedding decision. For the rank test, we computed the percentage of questions where the correct answer achieved the highest similarity score and the MRR measure (Craswell, 2009

Task Significance
The task of identifying paraphrases specifically extracted from pyramids can aid NLP sub-fields such as:  Automatic Summarization: Identifying paraphrases can both help identifying salient information in multi-document summarization and evaluation by recreating pyramid files and applying them on automatic summaries;  Textual Entailment: Paraphrases are bidirectional entailments;  Sentence Simplification: SCUs capture the central elements of meaning in observable long sentences.

 Expansion of Annotated Datasets:
Given an annotated dataset (e.g., aligned translations), unannotated sentences could be annotated the same as their paraphrases

Conclusion
We presented a method of using pyramid files to generate paraphrase detection tasks. The suggested task has proven challenging for the tested methods, as indicated by the relatively low Fmeasures reported in Table 1 on most models. Our method can be applied on any pyramid annotated dataset so the reported numbers could increase by using other datasets such as TAC 2008TAC , 2009TAC , 2010TAC , 2011TAC and 2014 . We believe that the improvement that this task can provide to downstream applications is a good incentive for further research.