Assessing the Difficulty of Classifying ConceptNet Relations in a Multi-Label Classification Setting

Commonsense knowledge relations are crucial for advanced NLU tasks. We examine the learnability of such relations as represented in ConceptNet, taking into account their specific properties, which can make relation classification difficult: a given concept pair can be linked by multiple relation types, and relations can have multi-word arguments of diverse semantic types. We explore a neural open world multi-label classification approach that focuses on the evaluation of classification accuracy for individual relations. Based on an in-depth study of the specific properties of the ConceptNet resource, we investigate the impact of different relation representations and model variations. Our analysis reveals that the complexity of argument types and relation ambiguity are the most important challenges to address. We design a customized evaluation method to address the incompleteness of the resource that can be expanded in future work.


Introduction
Commonsense knowledge can be seen as a large amount of diverse but simple facts about the world, people and everyday life, e.g., Cars are used to travel or Birds can fly (Liebermann, 2008). Commonsense knowledge obtained from CONCEPTNET is increasingly used in advanced NLU tasks, such as textual entailment (Weissenborn et al., 2018), reading comprehension (Mihaylov and Frank, 2018), machine comprehension (Wang and Li, 2018;José-Angel González and Hurtado Oliver, Lluís and Segarra, Encarna and Pla, Ferran, 2018), question answering (Ostermann et al., 2018) or dialogue modeling (Young et al., 2018) and also applications in vision (Le et al., 2013). Some of these approaches exploit embeddings learned from CONCEPTNET, others select specific relations from it, depending on the application. This paper proposes a multi-label neural approach for classifying CONCEPTNET relations, where the task is to predict one (or several) commonsense relations from a given set of relation types that hold between two given concepts from CONCEPTNET. In future work, the predicted relations can then be used for enriching CONCEPTNET by adding relations between concepts which are not yet linked in the network.
We design the task of multi-label neural relational classification to account for specific properties of CONCEPTNET: (i) CONCEPTNET's relation inventory is not designed to be disjunct: a given pair of relation arguments (in CONCEPTNET: concepts) may be connected by more than one relation type: e.g. people,DESIRES/CAPABLEOF,eating in groups , reading,USEDFOR/CAUSES,education . This places relations in close vicinity in semantic space, making relation prediction a hard task.
(ii) Concepts often are multi-word expressions of different phrase types (e.g., noun or verb phrases), posing a challenge for argument representation. Relation slots may also be filled by different semantic types: e.g., the 2nd argument of DESIRES can be an entity or event. Such heterogeneous signatures increase classification difficulty.
(iii) As any knowledge resource, CONCEPTNET is incomplete, which means that relations between concepts are missing. The incompleteness of the resource poses serious evaluation problems, since assumed negative instances may in fact be positive.
To tackle these issues we perform a thorough experimental examination of the learnability of CON-CEPTNET relations in a controlled multi-label classification setting. Our contributions are: (i) a cleaned and balanced data subset covering the 14 most frequent relation types from the core part of CONCEPT-NET that serves as a basis for assessing relation-specific classification performance. We extend this dataset to an open-world classification setup; (ii) a neural multi-label classification approach with various model options for the representation of relations and their (multi-word) arguments, including relationspecific label prediction thresholds; (iii) an in-depth analysis of specific properties of the CONCEPT-NET relation inventory, from which we derive hypotheses that we evaluate in classification experiments; (iv) we perform detailed analysis of results that confirm a great number of our hypotheses regarding specific classification challenges; (v) finally, we assess the amount of potential evaluation discrepancies due to the incompleteness of the resource in a small-scale annotation experiment.

Semantic Relation Classification
Semantic relation classification covers a wide range of methods and learning paradigms for representing relation instances (see Nastase et al. 2013 for an overview). Typically, the data is presented to the learner as independent instances, with or without a sentential context. Relation classification models represent the meaning of the arguments (attributional features) and if context is available, also the relation (relational features).
Recently Deep Learning has strongly influenced semantic relation learning. Word embeddings can provide attributional features for a variety of learning frameworks (Attia et al., 2016;Vylomova et al., 2016), and the sentential context -in its entirety, or only the structured (through grammatical relations) or unstructured phrase expressing the relation -can be modeled through a variety of neural architectures -CNN (Tan et al., 2018;Ren et al., 2018) or RNN variations (Zhang et al., 2018). Speer et al. (2008) introduce AnalogySpace, a representation of concepts and relations in CONCEPT-NET built by factorizing a matrix with concepts on one axis and their features or properties (according to CONCEPTNET) on the other. This low-dimensional representation allows for finding analogous facts, generalizations, new categories and justifications for classifications based on known properties. While this representation allows for recomputing the confidence of existing facts, the focus was not on classifying or trying to learn specific relations represented in the resource. Li et al. (2016) apply matrix factorization to CONCEPTNET with the aim of resource extension and report 91% accuracy in a binary evaluation (i.e., verifying the correctness of an (unlabeled) link between concepts). Saito et al. (2018) expand this work by combining the knowledge base completion task (distinguishing true relation triples consisting of arbitrary phrases from false ones) with the task of knowledge generation (finding the second entity for a given first entity and a given relation). They enhance the link prediction model of Li et al. with a model that learns the two tasks -knowledge base completion and knowledge generation -jointly and outperform the completion accuracy results of Li et al. by up to 3pp. Many NLU tasks rely on specific relations from CONCEPTNET (Le et al., 2013;Shudo et al., 2016). It is thus important to assess classification accuracy for individual relation types.

CONCEPTNET Dataset
The Open Mind Common Sense (OMCS) project (Speer et al., 2008) started the acquisition of common sense knowledge from contributions over the web, leading to CONCEPTNET, which now also includes expert-created resources (such as WordNet) and automatically extracted knowledge or knowledge obtained through games with a purpose (Speer et al., 2008). The current version, CONCEPTNET 5.6, comprises 37 relations, some of which are commonly used in other resources like WordNet (e.g. ISA, PARTOF) while most others are more specific to capturing commonsense information and as such are particular to CONCEPTNET (e.g. HASPREREQUISITE, MOTIVATEDBYGOAL). With very few exceptions (e.g., SYNONYM or ANTONYM), CONCEPTNET-relations are asymmetric. The English version consists of 1.9 million concepts and 1.1 million links to other databases, such as DBpedia. In our work we focus on the English OMCS subpart (CN-OMCS).

Task Definition
Given a pair of concepts c i , c j , where c i , c j may be multi-word expressions, the task is to automatically predict one (or several, see §3.3.2 for the multi-label aspect of the task) commonsense relations r t from a given set of CONCEPTNET relation types R CN that hold between c i and c j . Relations are presented to the classifier without textual context, and thus a crucial aspect is using a representation that properly captures the semantics of the arguments.

Designing a Relation Classification System for CONCEPTNET
CONCEPTNET has very specific properties in terms of the relations included, the type of the arguments, coverage and completeness. A successful relation classification system should take these into account. Given the heterogeneity of sources of CONCEPTNET, we focus on its core part, in particular CN-OMCS-CLN, a subset selected from CN-OMCS that includes ca. 180K triples from 36 relation types, restricted to known vocabulary from the GoogleNews Corpus (see §4.1 for further details).

Representing the Inputs
Word embeddings have been shown to provide useful semantic representations, capturing lexical properties of words and relative positioning in semantic space (Mikolov et al., 2013b), which has been exploited for semantic relation classification (Vylomova et al., 2016;Attia et al., 2016).
Following this work, we represent a pair of concepts c i , c j whose relation we want to classify through their embeddings v c i and v c j . These argument representations can be combined by subtraction One of the issues in using such representations for CONCEPTNET is the fact that most CONCEPT-NET concepts are multi-word expressions (1.93 words on average, cf. Table 2). We experiment with two ways of producing a representation for a multi-word concept: (i) computing a centroid vector, as the normalized sum over the embedding vectors of all words in the expression (as the baseline); (ii) encoding the expression using an RNN, e.g. a (Bi)LSTM, which encodes sequences of various lengths into one fixed-length vector. We hypothesize that using an RNN yields better concept representations than centroid vectors.

Constructing a Multi-Label Classifier
An important characteristic of CONCEPTNET is that more than one relation can hold for a given pair of concepts. On average this applies to 5.37% of instances per relation (cf. 2). Consequently, we cast our classification task as a multi-label classification problem. Model architecture. Fig. 1 illustrates the model architecture. Input concept pairs are encoded -as centroids or using RNNs -and the representations are combined and presented to a feed-forward neural network (FFNN) with one hidden layer to non-linearly separate the relation classes.
In single-label classification, the probability for a class is not independent from the other class probabilities. Hence, sof tmax is typically used at the output layer. By contrast, in multi-label classification, we want to model class predictions individually. The sigmoid models the probability of a label as an independent Bernoulli distribution: This actually translates to an independent binary neural network for each label, resulting in a set of isolated binary classification tasks (cf. Sterbak 2017; He and Xia 2018).
,where x is the input vector and W h and W o are weight matrices. We use binary cross entropy as our loss function. The architecture allows us to tune pre-trained embeddings for our relation learning task. The hypotheses arising from the multi-label setting of CONCEPTNET are: (i) discrimination of overlapping classes is more difficult, compared to the usual relation classification task with disjoint relations (e.g. Hendrickx et al. 2010). (ii) given the incompleteness of CONCEPTNET, the classification performance may be erroneously assessed due to missing relations in the data. We will estimate the effect of this phenomenon in a small-scale annotation experiment.

Relation Classification Difficulty and Relation-specific Thresholding
CONCEPTNET relation types show great divergence with respect to their argument's semantic and phrase types, as shown in 2. About half of the relation arguments are nominal, entity-denoting concepts, with location as a specific entity type, half of them are event-type arguments. Several relations take different semantic types in a single argument position (e.g., HASSUBEVENT, CAUSES). Diversity of semantic types and phrase types -especially within a single argument position -is a challenge for relation learning. We expect classification to be more difficult on relations with a mixture of argument types. Because of this, different thresholds may be needed for predicting different relation types. We adopt a customized multi-label prediction setup where we tune thresholds separately for each relation type. We expect that individually tuned, relation-specific thresholds improve overall classification performance.   Table 1). The OTHER class comprises all relations from CN-OMCS-CLN with less than 2000 instances.

CN-OMCS-14
Based on CN-OMCS-CLN we construct our experimental dataset CN-OMCS-14a balanced dataset still large enough for applying neural methods. We include all relations from CN-OMCS-CLN with more than 2000 instances, and downsample to the least frequent class -2586 instances per relation. To select the "best" instances for testing and tuning, we sort the relation triples by their confidence score, as provided by CONCEPTNET. Inspired by Li et al. (2016) we select the 10% (258) most confident tuples per relation for testing, the next 10% for development, the remaining 80% (2068) for training, cf. Table 3.
Closed vs. Open World Setting. Learning to classify relations in a closed world setting is limited to the relation types present in the data. We want to design a system that is also able to detect whether a relation exists between concepts -but none of the provided ones, or whether no relation holds. We thus extend the data set with two classes: OTHER -containing concept pairs that do stand in a relation, yet not any of those present in the target relation set; and RANDOM -containing concept pairs that are not related.
Instances for the OTHER class consist of a sample of triples from the 22 low-frequency relations that were not included in CN-OMCS-14, these are the following relations: MADEOF, DBPEDIA, RE- LATEDTO  HASLASTSUBEVENT, EXTERNALURL, INSTANCEOF, NOTCAPABLEOF, and NOTHASPROPERTY. Instances for the RANDOM class are generated similarly to Vylomova et al. (2016): 50% of instances are opposite pairs, obtained by switching the order of concept pairs within the same relation; 50% are corrupt pairs, obtained by replacing one concept in a connected pair with a random concept from the same relation. Using corrupt pairs ensures that our model does not simply learn properties of the word classes, but instead is forced to encode relation instances. RANDOM and OTHER are the same size as the individual target relations.

Experiment Setup
Experiments and Datasets. We experiment with two open world settings: in OW-1 we add only the RANDOM class to CN-OMCS-14, to investigate whether the classifier is able to differentiate related from non-related concept pairs. in OW-2 we add both OTHER and RANDOM to CN-OMCS-14, to investigate whether the classifier can also learn to predict that an unknown relation exists or that no relation holds. We also report results of the closed world setting where we exclude OTHER and RANDOM. Each dataset is split into training (80%), dev (10%) and test (10%) (cf. Table 3). Evaluation. We evaluate model performance in terms of F1 score for each relation. We report averaged weighted F1 scores over 5 runs.

Model Parameters
Embeddings. Based on preliminary experiments 3 , we use 300-dim. skip-gram word2vec embeddings trained on part of the Google News dataset (100 billion words, Mikolov et al. 2013a). Embeddings are tuned during training. Concept representation. Concept are encoded using centroid vectors or an RNN (cf. §3.3.1). Relation representation. We use the ConcatVec representation ( §3.3.1), which we determined to be the most useful in preliminary experiments. Label prediction thresholds are tuned in two ways: (i) a global threshold for all relations and (ii) separately tuned thresholds for each relation. Hyperparameter settings were determined on the devset. For encoding of multiword terms we use bi-LSTMs with one hidden layer and a cell size of 350 (perform better than GRUs and LSTMs).  Table 4 summarizes the results in open (OW) and closed world (CW) settings.

Results
The overall best performing model across all settings is FFNN+RNN (as opposed to FFNN with centroid argument representations) with relation-specific label prediction thresholds (as opposed to one global threshold value). In the OW setting we achieve overall F1-scores of 0.68 (OW-1) and 0.65 (OW-2). The CW setting leads to best results with 0.71 F1. The models improve by 4pp (OW-1), 7pp (OW-2) and

Analysis
In this section we will discuss the hypotheses derived from our analysis of CONCEPTNET properties ( §3.3), and based on that, to determine which approaches and representations are best suited for CON-CEPTNET-based commonsense relation classification. To aid the discussion we produced Figures 2, 3, 4, and Table 5. Fig. 2 plots differences in performance for each relation for the setting we wish to compare: concept encoding using centroids (FFNN) vs. RNNs (FFNN+RNN) (blue), global vs. relation-specific prediction threshold (orange), and OW-1 vs. CW setting (grey). Fig. 3 visualizes ambiguous -that means co-occurring -relations in our dataset in a symmetric heatmap. Fig. 4 displays interrelations between concept characteristics and model performance, based on our best performing system (FFNN+RNN+ind. tuned thresholds, OW-1). To observe correlations between clas-sification performance and different measurable characteristics of the data in Fig. 4, we scaled the following values for each relation to a common range of 0 to 15: the percentage of multi-word terms (cf. Table 2) (grey), the average number of words per concept (cf. Table 2) (yellow), percentage of relation instances with multiple labels (cf. Table 2) (blue), best model performance on OW-1 (FFNN+RNN with individually tuned thresholds, cf. Table 4) (red) and the corresponding relation-specific thresholds (green). Table 5 gives relation statistics on CN-OMCS-14 (as opposed to Table 2, which gives statistics for the complete version CN-OMCS-CLN).

Representing Multi-word Concepts
We hypothesized that there is a correlation between the length of the arguments and model performance when encoding arguments with an RNN. We find no such correlation -the relations that benefit the most from using an RNN (Fig. 2: blue and Fig. 4: yellow, red) are not those with the longest arguments (cf. Table 5). Instead we find that the relations HASPROPERTY, HASA, ATLOCATION, and RECEIVESACTION benefit most from concept encoding with a RNN, followed by CAPABLEOF, ISA, DESIRES, CAUSES-DESIRE and HAS(FIRST)SUBEVENT with lower margins. The missing correlation can be confirmed by a very low Pearson's coefficient of only 0.05 between (1) improvements we get from enhancing FFNN with RNN (i.e., delta of F1 scores for FFNN vs. FFNN+RNN; both with individually tuned thresholds) and (2) the average number of words per concepts (cf. Table 2).

Threshold Tuning & Model Performance
We hypothesized that relations would benefit from having individually tuned thresholds. Overall, the models with RNN encoding of concepts benefit more from threshold tuning than the basic FFNN. Regarding single relations (Fig. 2, orange bars), HASSUBEVENT and the open world class RANDOM benefit the most from individual threshold tuning (both with relatively low F1 scores). The individual thresholds vary considerably across relations (Fig. 4). To test whether relations that are harder to classify benefit the most from tuning the threshold (as the performance of HASSUBEVENT and RANDOM seem to indicate), we compute the correlation between (1) the difference of model performance with and without individually tuned thresholds (as described above) and (2) general model performance (F1 scores of FFNN+RNN with global threshold, OW-1). The score of -0.67 Pearson correlation indicates that indeed relations with lower general performance will tend to have higher improvements. This is also reflected in Fig. 4 (green and red), which also shows that for relations with higher F1 scores, higher thresholds tend to work better. Relation classification models applied to CONCEPTNET should therefore have higher thresholds for relations with high classification confidence (high F1 scores), while for relations with low performance lower thresholds are recommended.

Closed vs. Open World Setting
Most relations perform better in the CW setting (cf. grey bars in Fig. 2), especially MOTIVATEDBY-GOAL, HASFIRSTSUBEVENT, ATLOCATION, and ISA (Fig. 2, grey). In contrast, DESIRES and HAS-SUBEVENT perform better in an open world setting (Fig. 2). Comparing the two settings OW-1 and OW-2 (Table 4, not displayed in Fig. 2), we find that only the relations MOTIVATEDBY, HASFIRST-SUBEVENT and CAUSES perform better in OW-2 than in OW-1. All other relations benefit from the OW-1 setting, especially ATLOCATION and the open world class RANDOM.

Relation Heterogeneity
We hypothesized that relations that are more heterogeneous with respect to the type of their arguments (whether semantic or phrasal) will be harder to learn. Comparing the degree of diversity of semantic or phrase types (Table 2) with model performance confirms this hypothesis. The relations that perform best have semantically or "phrasally" consistent arguments, whereas (apart from DESIRE) relation types that feature different types of entities or phrases in the same argument position tend to achieve low F1 scores.

Relation Ambiguity
We hypothesized that relations that have multi-labeled instances (instances to which more than one label -relation -applies) will be more difficult to learn.

Favorable vs. Unfavorable Properties of CONCEPTNET
We have investigated several variations of a relation classification model, each variation designed to mitigate some particular feature of CONCEPTNET relations. Analysis of these models have shown what impact each has on the model performance, and which issues we could address and which we could not. One of the issues was the length of the arguments. Using an RNN that can encode such sequences of various lengths did not lead to consistent improvements for relations with long arguments. The classifier still performs best on relations with short arguments. However, we do obtain overall better results with RNN encoding of arguments.
Another issue was the heterogeneity of relations in terms of the semantic or phrasal type of their arguments. The analysis has shown that indeed such relations suffer during classification, but individual tuning of the threshold partly helps.
One of the most striking challenges posed by the CONCEPTNET relation inventory remains the observed relation ambiguity. Here, our analysis matches our hypothesis, which was that multi-relation instances are harder to classify than relations for which we rarely find relation instances which co-occur with other relation labels. We further find that individual threshold tuning helps improving classification performance, especially for relations which are harder to classify and are characterized by low F1 scores. These are again exactly the relations which usually show other challenging properties including relation ambiguity, long arguments, and inner-relation diversity regarding concept and phrasal types.

Impact of Missing Edges
The ambiguity of CONCEPTNET relations combined with the incompleteness of the resource pose challenges for evaluating the performance of a model. A classification decision marked as false positive could in fact be valid. This issue penalizes single-label and multi-label classifiers differently: a singlelabel classifier is not allowed to predict multiple labels, while a multi-label classifier will learn from potentially false negatives and depending on the distribution of the data could learn to over-predict. To investigate to what degree this issue impacts the results of our model, we manually annotate a small sample of the test data and compare it to the gold standard.
Annotation Experiment. We performed a small annotation experiment in which we manually control a subset of 200 instances from our test set for missing edges. Our sample consists of concept pairs which are related with one of the 14 relations in CN-OMCS-14, and we want to investigate if another, additional relation holds between the two concepts. We therefore present the concept pair and a randomly sampled relation from our relation set (excluding the gold label) to two annotators without showing the gold label. We ask them if the relation applies or not, and they are also allowed to assign Not Sure as a third option. The annotators agreed in 178 of 200 instances (91%). The annotations are merged by a third expert annotator. In the final gold version 18 (9%) of the instances are labelled as applicable (e.g. cook dinner,HASPREREQUISITE,turn on stove ), while 176 (88%) don't apply according to the annotators (e.g. coffee,HASSUBEVENT,popular drink ). According to this small annotated subset, we conclude that a lower bound of almost 9% of the predictions could be penalized due to incompleteness of the CONCEPTNET resource or our extracted subset, respectively. Of course this has to be verified in an annotation experiment of a larger scale.

Conclusion
In this paper we investigated several variations of a multi-label neural relational classifier for CONCEPT-NET relations. Each variation was designed to account for specific properties of CONCEPTNET. An indepth study revealed specific characteristics that can make CONCEPTNET relation classification difficult: several distinct relation types may hold for a given concept pair; some relations have heterogeneous arguments; and many concepts are expressed through multi-word terms. In light of these challenges posed by the specific properties of CONCEPTNET, we design a multi-label classification model which uses RNNs for representing multi-word arguments and individually tuned thresholds for improving model performance, especially for relations with unfavorable properties such as long arguments, relation ambiguity and inner-relation diversity. Our best performing model achieves F1 scores of 68 in an open world and 71 in a closed world setting. The analysis of the results in different configurations shows that the design decisions driven by multi-word representations and threshold tuning improved the overall classification performance, and that our model is able to tackle specific properties of CONCEPTNET. Yet, some challenges could not be resolved and need to be addressed in future work. In particular this concerns relation ambiguity and heterogeneity of relation arguments. The observed co-occurences of relations could be deployed for targeting relation ambiguity by building a meta classifier which learns which relations can or cannot occur together.
In future work, we plan to use the multi-label classification system proposed in this paper for enriching CONCEPTNET by predicting relations between concepts which are not yet linked in the network. Our investigation can further inform and caution the community on both the usefulness and the flaws of this resource and guide future work on using CONCEPTNET.