Apples to Apples: Learning Semantics of Common Entities Through a Novel Comprehension Task

Understanding common entities and their attributes is a primary requirement for any system that comprehends natural language. In order to enable learning about common entities, we introduce a novel machine comprehension task, GuessTwo: given a short paragraph comparing different aspects of two real-world semantically-similar entities, a system should guess what those entities are. Accomplishing this task requires deep language understanding which enables inference, connecting each comparison paragraph to different levels of knowledge about world entities and their attributes. So far we have crowdsourced a dataset of more than 14K comparison paragraphs comparing entities from a variety of categories such as fruits and animals. We have designed two schemes for evaluation: open-ended, and binary-choice prediction. For benchmarking further progress in the task, we have collected a set of paragraphs as the test set on which human can accomplish the task with an accuracy of 94.2% on open-ended prediction. We have implemented various models for tackling the task, ranging from semantic-driven to neural models. The semantic-driven approach outperforms the neural models, however, the results indicate that the task is very challenging across the models.


Introduction
In the past few years, there has been great progress on core NLP tasks (e.g., parsing and part of speech tagging) which has renewed interest in primary language learning tasks which require text under-standing and reasoning, such as machine comprehension (Schoenick et al., 2016;Hermann et al., 2015;Rajpurkar et al., 2016;Mostafazadeh et al., 2016). Our question is how far have we got in learning basic concepts of the world through language comprehension. If we look at the large body of work on extracting knowledge from unstructured corpora, we will see that they often lack some very basic pieces of information. For example, let us focus on the basic concept of apple, the fruit. What do the state-of-the-art systems and resources know about an apple? None of the state-of-the-art knowledge bases (Speer and Havasi, 2012;Carlson et al., 2010; include much precise information about the fact that apples have an edible skin, vary from sweet to sour, are round, and relatively the same size of a fist. Moreover, there is no clear approach on how to extract such information, if any, from trained word embeddings. This paper focuses on how we can automatically learn about various attributes of such generic entities in the world. A key observation motivating this work is that we can learn more detail about objects when they are compared to other similar objects. When we compare things we often contrast, that is, we count their similarities along with their dissimilarities. This results in covering the primary attributes and aspects of objects. As humans, we tend to recall and mention the difference between things (say green skin vs. red skin in apples) as opposed to absolute measures (say the existence of skin). Interestingly, there is evidence that human knowledge is structured by semantic similarity and the relations among objects are defined by their relative perceptual and conceptual properties, such as their form, function, behavior, and environment (Collins and Loftus, 1975;Tversky and Gati, 1978;Cree and Mcrae, 2003). Our idea is to leverage comparison as a way of naturally learning about common world concepts and their specific attributes.
Comparison, where we name the similarities and differences between things, is a unique cognitive ability in humans 1 which requires memorizing facts, experiencing things and integration of concepts of the world (Hazlitt, 1933). It is clear that developing AI systems that are capable of comprehending comparison is crucial. In this paper, in order to enable learning through comparison, we introduce a new language comprehension task which requires understanding different attributes of basic entities that are being compared.
The contributions of this paper are as follows: (1) To equip learning about common entities through comparison comprehension, we have crowdsourced a dataset of more than 14K comparison paragraphs comparing entities from nine broad categories (Section 2). This resource will be expanded over time and will be released to the public. (2) We introduce a novel task called GuessTwo, in which given a short paragraph comparing two entities, a system should guess what the two things are. (Section 3). To make systematic benchmarking on the task possible, we vet a collection of comparison paragraphs to obtain a test set on which human performs with an accuracy 94.2%. (3) We present a host of neural approaches and a novel semantic-driven model for tackling the GuessTwo task (Sections 4, 5). Our experiments show that the semantic approach outperforms the neural models. The results strongly suggest that closing the gap between system and human performances requires richer semantic processing (Section 6). We hope that this work will establish a new base for a machine comprehension test that requires systems to go beyond information extraction and towards levels of performing basic reasoning.

Data Collection
To enable learning about common entities, we aimed to create a dataset which meets the following goals: 1. The dataset should be a collection of highquality documents which are rich in compar-1 It has been suggested (Hazlitt, 1933) that children under seven years old cannot name differences between simple things such as peach and apple. This further shows that the ability for comparison develops at a later age and is cognitively complex. ing and contrasting entities using their various attributes and aspects.
2. The comparisons in the dataset should involve everyday non-technical concepts, making their comprehension easy and commonsense for a human.
After many experiments with scraping existing Web resources, we decided to crowdsource the comparison paragraphs using Amazon Mechanical Turk 2 (Mturk). We prompt the crowd workers as follows: "Your task is to compare two given items in one simple language paragraph so that a knowledgeable person who reads it can guess what the two things are". The workers were instructed to compare only the major and well-known aspects of the two entities. We also asked them to use X and Y for anonymously referring to the two entities. Table 1 shows three examples of our crowdsourced comparison paragraphs. As these examples show, the paragraphs are very contentful and rich in comparison which meets our initial goals in the dataset creation. Entity Pair Selection. The choice of the two entities which should be compared against each other plays a key role in the quality of the collected dataset. It is evident that naturally, we compare two things which are semantically similar, yet have some dissimilarities 3 , such as jam and jelly. Given the goals of our task, we experimented with concrete nouns which share a common taxonomy class. We choose semantic classes which have at least five well-known entities. So far, we have covered nine broad categories as shown in Figure 2, with 21 subcategories shown in Figure 3. We use Wikipedia item categories and the Word-Net (Miller, 1995) ontology for identifying entities from each subcategory. Then, we choose the most common entities by looking up their frequency on Google Web 1T N-grams 4 . We manually inspected the frequency-filtered list to make sure that the entities are rather easy to describe without getting technical. Given the list of entities, we paired each entity with at most five and at least three other entities from the same subcategory. We also include inter-subcategory compar-Comparison Paragraph Entity X Entity Y Both X and Y are fruits and a variety of apples. X and Y are generally similar in size. X are dark red in color when ripe, while Y are a bright green color. X is sweeter and softer than Y in taste and texture, sometimes starchy. Y are tart and somewhat stringy. Y is often used in cooking, whereas X is not.

Red Delicious Apple Fruit
Granny Smith Apple Fruit The X and Y are two types of vehicles. X is a smaller vehicle than Y. The X has two wheels while Y has none. The X travels on roadways and smooth surfaces, whereas Y is capable of flying. Only one or two people are able to ride on X at once, while Y can carry more people.

Motor Vehicle Vehicle
Helicopter Aircraft Vehicle X and Y are both types of world cuisines. X incorporates a lot of pasta dishes and sauces, with basil, tomato, and cheese being major ingredients. Y consists of many curries and stir fried dishes, with coconut and lemongrass being used often. Y is generally spicier and more aromatic than X. X is a European cuisine, while Y is an Asian cuisine.

Italian Cuisine Cuisine Cuisine
Thai Cuisine Cuisine Cuisine Table 1: Examples from the GuessTwo comprehension dataset. Also provided with the dataset is the subcategory and the broad category of the entities which are listed below the entity names in this Table. Figure 1: An example illustrating the entity pair matching process.
ison for a handful of entities at the boundaries. Figure 1 illustrates our entity pair matching process with an example on subcategories 'apple' and 'citrus'.
Data Quality Control. Our task of free-form writing is trickier than many other tasks such as tagging on Mturk. To instruct the non-expert workers, we designed a qualification test on Mturk in which the workers had to judge whether or not a given paragraph is acceptable according to our criteria. We used three carefully selected paragraphs to be a part of the qualification test. Moreover, to further ensure the quality of the submissions, one of our team members qualitatively browsed through the submissions and gave the workers detailed feedback before approving their paragraphs.
For each pair of entities, we collected eight comparison paragraphs from different workers. Given that different workers have different perspectives on what the major aspects to be compared are, collecting multiple paragraphs helps further enriching our dataset. We constrained the paragraphs to be at least 250 characters and at most 850 characters. Table 2 shows the basic statistics of our dataset. In this Table, we also included the median number of adjectives (includ-ing comparatives) per paragraph as a measure of descriptiveness of the comparison paragraphs. As a point of reference, the median number of adjectives in a random Wikipedia paragraph of the same length is 5.  Given the quality control we have in place, our data collection is going slowly. So far we have collected 14,142 paragraphs; however, we are aiming   to expand the resource over time.
Test Set Creation. In order to enable benchmarking on the task, we assessed the quality of a random sample of GuessTwo paragraphs as follows: we show the paragraph to three human workers on Mturk and ask them to guess what the two things are. Then, we choose 520 paragraphs for which all three workers have made exactly correct guesses for both entities. The test set will also be expanded along with the further data collection.
We divided the rest of the GuessTwo dataset into training and validation sets, with a 90%/10% split. To ensure that the test set requires some level of basic reasoning, our training set does not share any exact entity pairs with the validation or test set. This further enforces systems to learn about entities indirectly by processing across paragraphs. For instance, as shown in Figure 4, at test time, a system should be able to guess a comparison involving the entities blood orange vs. lemon by having seen comparisons of blood orange vs. tangerine and tangerine vs. lemon.

The GuessTwo Task Definition
We define the following two different schemes for the GuessTwo task: • Open-ended GuessTwo. Given a short paragraph P which compares two entities X and Y, guess what the two entities are. The scope of this prediction is the set of all entities appearing in the training dataset.
• Binary Choice GuessTwo. Given a short paragraph P which compares two entities X and Y, and two nominals n 1 and n 2 , choose 0 if n 1 = X and n 2 = Y, choose 1 otherwise. We speculate that system which can successfully tackle the GuessTwo task, has achieved two major objectives: (1) Has successfully learned the knowledge about entities stored in any form (e.g., continuous-space representation or symbolic) (2) Has a basic natural language understanding capability, using which, it can comprehend a paragraph and access its knowledge. We predict that our training dataset has enough detailed information about entities for learning the required knowledge for tackling the task. Given the design of our dataset, at test time, a system should perform some level of reasoning to go beyond understanding only one paragraph.

Neural Models
In this Section we present various end-to-end neural models for tackling the task of GuessTwo.
Continuous Bag-of-words Language Model. This model computes the probability of a sequence of consecutive words in context. The premise is that the probability of a paragraph with the correct realization of X and Y should be higher than the a paragraph with incorrect realizations. In order to compute the probability of a word given a context we use Continuous Bag-of-words (CBOW) (Mikolov et al., 2013a) which models the following conditional probability: here, C(w) is the context of the word w and θ is the model parameters. Then, the probability of a sequence of words (in a paragraph) is computed as follows: We define context to be a window of five words.   (2) the GuessTwo training dataset. We call these models CBOW-Wikipedia and CBOW-GuessTwo respectively. At test time, for open-ended prediction we find the two nominals which maximize the following probability: where C(w i ) x,y indicates the context in which any occurrences of X have been replaced with x and Y's have been replaced with y. For binary choice classification, we use the same modeling except that we only consider x = n 1 , y = n 2 and x = n 2 , y = n 1 . Encoder-Decoder Recurrent Neural Net 5 http://mattmahoney.net/dc/text8.zip (RNN). This model is a sequence-to-sequence generation model Sutskever et al., 2014) that maps an input sequence to an output sequence using an encoder-decoder RNN with attention . The encoder RNN processes the comparison paragraph and the decoder generates the first item followed by the second item ( Figure 6). The paragraph is encoded into a state vector of size 512. This vector is then set as the initial recurrent state of the decoder. We tune the model parameters on the validation set, where we set the number of layers to 2. The model is trained end-to-end, using Stochastic Gradient Descent with early stopping.
For open-ended prediction, we use beam search with beam-width = 25 and then output the two tokens with the highest probability. For binary choice classification, we use the same model where we set the encoder RNN inputs to the input paragraph tokens, then, we set the input of the decoder RNN once to [n 1 , n 2 ] and next to [n 2 , n 1 ]. After running the network forward, we take the probability of the decoder logits and choose the ordering which has the highest probability. Skip-gram model (Mikolov et al., 2013b) on 100 billion words of Google News 6 . For open-ended prediction, the output of CNN is fed forward and transformed into a 300 dimension vector. Then, we use a softmax layer to get the probability of each of the possible nominals for X and Y. For binary choice classification, we use the same architecture and settings as above. Additionally, we encode each nominal into a 300-dimensional vector, which then gets concatenated with the paragraph vector. Figure 5c shows this model.

Semantic-driven Model
In this Section we present a semantic-driven approach which models the comparison paragraph using semantic features and is capable of performing basic reasoning across paragraphs.

Representing Paragraphs
The question is, given a comparison paragraph, what is the best representation which can enable further reasoning? The comparison paragraphs often have complex syntactic and semantic structures, which might be challenging for many offthe-shelf NLP tools to process. For instance, consider the sentence X is much sweeter in taste than Y. Although a dependency parser provides a lot of information regarding how the individual words relate grammatically, it does not give us any information regarding how Y's sweetness (which is elided from the sentence and is implicit) relates to X's. As another processing technique, if we use the standard information extraction methods for extracting and representing syntactic triplets (ar-gument1, relation, argument2) (Fader et al., 2014;, we will extract a triplet such as X is sweeter which shares the same shortcomings. Our approach for better representation of comparison paragraphs starts with a broad-coverage semantic parser (Banarescu et al., 2013;Bos, 2008;Allen et al., 2008). A semantic parser maps an input sentence to its formal meaning representation, operating at the generic natural language level. Here we use the TRIPS 7 (Allen et al., 2008) broad-coverage semantic parser. TRIPS provides a very rich semantic structure; mainly it provides sense disambiguated deep structures augmented with semantic ontology types. Figure 7 shows an example TRIPS semantic parse. In this graph representation, each node specifies a word in bold along with its corresponding ontology type on its left. The edges in the graph are semantic roles 8 . As you can see, this semantic parse represents the sentence by decoupling the token 'both' and attributing the property of 'be apple' to both X and Y.
In our comparison paragraphs there are two major types of sentences: • Sentences with Absolute Information. These sentences contain direct information about the entities, such as X is red or Both X and Y are very sweet. From each absolute sentence, we extract frames which describe the absolute attributes of the corresponding entity. We define a frame to be a subgraph of a semantic parse which involves exactly one entity and all of its semantic roles. Relying on the deep semantic features offered by the semantic parser, we perform negation propagation 9 and sequence decoupling, among others features. For example, given a sentence which has a sequence, as the one depicted in Figure 7, we perform sequence decoupling and extract the two frames [X Be Apple] and [Y Be Apple].
• Sentences with Relative Information. These sentences contain relative information about the two entities, for instance, X is somewhat sweeter than Y. As opposed to the sentences with absolute information, we cannot extract frames from sentences with comparisons directly. Various properties of entities can be associated with an abstract scale, such as 'size' or 'sweetness', on which dif-7 http://trips.ihmc.us/parser/cgi/parse 8 Refer to http://trips.ihmc.us/parser/ LFDocumentation.pdf for the full list of semantic roles in TRIPS parser. 9 A common construction which needs negation propagation is Neither X nor Y are ... . ferent entities can be compared. In order to extract such scales and the relative standing of items on them we use the structured prediction model presented in Bakhshandeh et al. (2016), which given a sentence predicts its comparison structures. Figure 8 shows an example predicate-argument structure that is predicted by this model. We use pretrained model on the annotated corpus (Bakhshandeh et al., 2016) of comparison structures. Given a comparison structure such as the one presented in Figure 8, we can extract the information that on the scale of 'sweetness' X is higher than Y. It is clear that one can build a large knowledge base of such relations by reading large collections of comparison paragraphs. We populate our knowledge base of relative information about entities as follows: First, we predict the comparison structure of each sentence and then extract a binary relation ≺ s which shows the relation on the scale of s. Second, for any scale s, we apply transitivity on its entities. As shown in equation 4, the binary relation ≺ s is transitive over the set of all entities, A. This process, called closure, enables us do basic reasoning and derives implicit relations on scales from explicit relations.
The product of this step is a structured knowledge base on entity ordering which we call the ordering lattice. Figure 9 shows an example partial ordering lattice inferred by our model, where the sweetness of Golden Delicious can be compared to Granny Smith through their direct link with Red Delicious.

Modeling
Given a paragraph P , we first extract the set of all the absolute information frames for X and Y (as described above), called F X (P ) and F Y (P ). Second, for the sentences with relative information, Figure 9: The inferred partial ordering lattice comparing the sweetness of different apples.
we extract all the binary relations ≺ s ∈ R(P ) that should hold between X and Y. Then, our objective is to find two realizations for X and Y that maximize the following: In order to compute the p(x|F X (P )) and p(y|F Y (P )) scores we used Regularized Gradient Boosting (XGBoost) classifier (Friedman, 2000), which uses a regularized model formulation to limit overfitting. We directly use each frame in the F X (P ) and F Y (P ) sets as the classifier features.
We use Integer Linear Programming (ILP) for formulating the constraints as follows: for each relation r ∈ R on the scale s, we lookup the scale s in the ordering lattice and make the blacklist B(P ) containing each pair of entities which do not satisfy the relation r. Our ordering lattice does not have perfect complete information, hence, we have Open World Assumption and only prune our search space not to include the already observed pairs which violate the relation. our ILP objective function will be the following: where N is the set of all possible realizations and b and b are the binary indicator variables, so b x = 1 indicates the realization of x for X.
In the case of open-ended prediction, the maximization presented in Equation 6 is carried out on the set N . In the case of binary choice classification, however, only the two choices of n 1 and n 2 are considered in the maximization.

Results
We evaluate all the models presented in Sections 4 and 5 using the following accuracy measure: #correct predictions of both entities #test cases (7) As for the open-ended prediction we compute the nominator of the accuracy measure using three various matching methods on both entities: (1) exact-match, (2) subcategory match, (3) broad category match.
As Table 3 shows, the semantic model outperforms all the neural models. Moreover, the ILP constraints have been very effective in directing the system in the correct search space. Among the neural models, the Encoder-Decoder RNN model performs noticeably better than other models when matching the subcategory and broad category. According to the exact-matching, neither of the CBOW models could guess any of the two test entities correctly. Overall, it is evident that the end-to-end neural models have not been able to generalize well and learn about the attributes of entities across various training paragraphs. This can be partly due to not being trained on large enough comparison training dataset. The semantic model, however, could outperform the neural models using the same amount of data. To a degree, this is because the semantic model leverages the basic language understanding capabilities offered by the semantic parser.
It is also important to note that our semantic approach is not only capable of binary and open-ended prediction, but it also offers two byproducts that can be used as knowledge in a variety of other tasks: (1) a set of the most important absolute information frames which can be chosen based on feature importance in the classification, (2) the partial ordering lattice of entities. Overall, the results strongly suggest that the GuessTwo task is challenging, with the open-ended scheme being the most challenging. There is a wide gap between human and system performance on this task, which makes it a very promising task for the community to pursue.

Related Work
The task of Machine Comprehension (MC) has gained a significant attention over the past few years. The major driver for MC has been the publicly available benchmarking datasets. A variety of MC tasks have been introduced in the community (Richardson et al.;Hermann et al., 2015;Rajpurkar et al., 2016;Hill et al., 2015), in which the system reads a short text and answers a few multiple-choice questions. The reading comprehension involved in these tests ranges from reading a short fictional story (Richardson et al.) to reading a short news article (Hermann et al., 2015). In comparison, in the GuessTwo task the reading comprehension involves reading a short comparison paragraph and one can say the multiple-choice question is the constant What are X and Y?
The CNN/DailyMail dataset consists of more than 100K short news articles with the questions automatically created from the bullet-point summaries of the original article. This dataset uses fill-in-the-blank-style questions such as 'Producer X will not press charges against Jeremy Clarkson' where the system should choose among all the anonymized entities in the corresponding paragraph to fill in X. The Stanford Question Answering (SQuAD) dataset is another recent machine comprehension test with over 500 Wikipedia articles and +100,000 crowdsourced questions. The answer to every question in this dataset is a span of text from the corresponding reading passage.
Human accuracy on CNN/DailyMail is estimated to be around 75% (Chen et al., 2016) with the current state-of-the-art at 76.1 on CNN (Sordoni et al., 2016), and 75.8 on DailyMail (Chen et al., 2016). The human F1 score on SQuAD dataset is reported to be at 86.8%, with the current state-of-the-art achieving 82.9%. Given these statistics, neither of these datasets leave enough room for further research. Given that in both these tasks the answer to the question is directly found in the provided passage, we argue that the community requires a more challenging MC task which goes beyond matching and needs some level of inference across passages. The GuessTwo task requires basic reasoning and inference across paragraphs for comprehending various aspects of entities relative to one another.
Another interesting task is MCTest (Richardson et al.), which is a reading comprehension test with 660 fictional stories as the passage and four questions per story. The human-level performance on MCTest is estimated to be around 90%, with the state-of-the-art achieving an accuracy of 70% (Wang et al., 2015). MCTest is also proven to be challenging, however, given its very limited training data, further progress on the task has been hindered. Yet another relevant QA task is the Allen AI Science Challenge (Clarke et al., 2010;Schoenick et al., 2016), which is a dataset of multiple-choice questions and answers from a standardized 8th grade science exam. The questions can range from simple fact lookup to complex ones which require extensive world knowledge and commonsense reasoning. This task requires machine reading of a variety of resources such as textbooks and goes beyond reading a couple of passages.

Conclusion
We introduced the novel task of GuessTwo, in which given a short paragraph comparing two common entities, a system should guess what the two entities are. The comparison paragraphs often have complex semantic structures which make this comprehension task demanding. Furthermore, guessing the two entities requires a system to go beyond only understanding one given passage and requires reasoning across paragraphs, which is one of the most under-explored, yet crucial, capabilities of an intelligent agent.
So far, we have crowdsourced a dataset of more than 14K comparison paragraphs comparing entities from nine major categories. For benchmarking the progress, we filter a collection of these paragraphs to create a test set, on which humans perform with an accuracy of 94.2%. For contin-uing our data collection, we would like to have a targeted entity pair selection where we particularly collect the missing relations in our partial ordering lattice. We believe that this process can help developing more effective systems. For the most recent statistics of the dataset and the best performing systems please check this website.
We presented a host of neural models and a novel semantic-driven approach for tackling the task of GuessTwo. Our experiments show that the semantic approach outperforms the neural models by a large margin. The poor performance of the neural models we experimented with can motivate designing new architectures which are capable of performing basic reasoning across paragraphs. The results strongly suggest that bridging the gap between system and human performance on this task requires models with richer language representation and reasoning capabilities. As a future work, we would like to explore the feasibility of marrying our semantic and neural models to exploit the benefits that each of them has to offer.

Acknowledgments
This work was supported in part by Grant W911NF-15-1-0542 with the US Defense Advanced Research Projects Agency (DARPA) and the Army Research Office (ARO). We would like to thank Linxiuzhi Yang for her help in the data collection and anonymous reviewers for their insightful comments on this work. We specially thank William de Beaumont for his invaluable feedback on this paper. We also thank the inputs from Steven Piantadosi, Brad Mahon, and Gregory Carlson on cognitive aspects of comparison.