Content selection as semantic-based ontology exploration

Natural Language (NL) based access to information contained in Knowledge Bases (KBs) has been tackled by approaches following different paradigms. One strand of research deals with the task of ontology-based data access and data exploration (Franconi et al., 2010; Franconi et al., 2011). This type of approach relies on two pillar components. The first one is an ontology describing the underlying domain with a set of reasoning based query construction operations. This component guides the lay user in the formulation of a KB query by proposing alternatives for query expansion. The second is a Natural Language Generation (NLG) system to hide the details of the formal query language to the user. Our ultimate goal is the automatic creation of a corpus of KB queries for development and evaluation of NLG systems. The task we address is the following. Given an ontology K, automatically select from K descriptions q which yield sensible user queries. The difficulty lies in the fact that ontologies often omit important disjointness axioms and adequate domain or range restrictions (Rector et al., 2004; Poveda-Villalón et al., 2012). For instance, the toy ontology shown in Figure 1 licences the meaningless query in (1). This happens because there is no disjointness axiom between the Song and Rectangular concepts and/or because the domain of the marriedTo relation is not restricted to persons.


Introduction
Natural Language (NL) based access to information contained in Knowledge Bases (KBs) has been tackled by approaches following different paradigms. One strand of research deals with the task of ontology-based data access and data exploration (Franconi et al., 2010;Franconi et al., 2011). This type of approach relies on two pillar components. The first one is an ontology describing the underlying domain with a set of reasoning based query construction operations. This component guides the lay user in the formulation of a KB query by proposing alternatives for query expansion. The second is a Natural Language Generation (NLG) system to hide the details of the formal query language to the user. Our ultimate goal is the automatic creation of a corpus of KB queries for development and evaluation of NLG systems.
The task we address is the following. Given an ontology K, automatically select from K descriptions q which yield sensible user queries. The difficulty lies in the fact that ontologies often omit important disjointness axioms and adequate domain or range restrictions (Rector et al., 2004;Poveda-Villalón et al., 2012). For instance, the toy ontology shown in Figure 1 licences the meaningless query in (1). This happens because there is no disjointness axiom between the Song and Rectangular concepts and/or because the domain of the marriedTo relation is not restricted to persons.
(1) Who are the rectangular songs married to a person?

Song Rectangular
∃ marriedTo.Person In this work, we explore to what extent vector space models can help to improve the coherence of automatically formulated KB queries. These models are learnt from large corpora and provide general shared common semantic knowledge. Such models have been proposed for related tasks. For example, (Freitas et al., 2014) proposes a distributional semantic approach for the exploration of paths in a knowledge graph and (Corman et al., 2015) uses distributional semantics for spotting common sense inconsistencies in large KBs.
Our approach draws on the fact that natural language is used to name elements, i.e. concepts and relations, in ontologies (Mellish and Sun, 2006). Hence, the idea is to exploit lexical semantics to detect incoherent query expansions during the automatic query formulation process. Following ideas from the work in (Kruszewski and Baroni, 2015;Van de Cruys, 2014), our approach uses word vector representations as lexical semantic resources. We train two semantic "compatibility" models, namely DISCOMP and DRCOMP . The first one will model incompatibility between concepts in a candidate query expansion and the second incompatibility between concepts and candidate properties.
2 Query language and operations Following (cf. (Guagliardo, 2009)) a KB query is a labelled tree where edges are labelled with a relation name and nodes are labelled with a variable and a non-empty set of concept names from the ontology.
The query construction process starts from the initial KB query with a single node. The four operations (cf. (Guagliardo, 2009) for a formal definition of the operations) available for iteratively refining the KB query are: add for the addition of new concepts and relations; substitution for replacing a portion of the query with a more general, specific or compatible concept; deletion for removing a selected part of the query; and weaken for making the query as general as possible. A sequence of query formulation steps illustrating these operations is shown in Figure 2.
I am looking for something. (Initial request) ... for a new car. (Substitution) ... for a new car sold by a car dealer. (Add relation) ... for a new car, a coupé sold by a car dealer. (Add concept) ... for a new car sold by a car dealer. (Deletion) ... for a car sold by a car dealer. (Weaken) Figure 2: Query formulation sequence.

Extracting KB queries
To automatically select queries from a KB, we randomise the application of the add and operation. That is, starting from a query tree with one node, the operation is iteratively applied at a randomly selected node up to a maximum number of steps 1 . The add operation divides in add compatible concepts and add compatible relations (cf. (Guagliardo, 2009)). Given a node n labelled with concept s, the first one will add another concept label s (e.g., Car and New in the example query tree in Section 2), the second will attach a relation and its range (p, o) to the node (e.g., (CarDealer, locatedIn, City)).
The add operation picks up a concept (relation) from a list of candidate concepts (relations) to expand the current query. These candidates are computed using reasoning operations on the query build so far and the underlying ontology. As discussed in Section 1, the lack of axioms in the ontology will enable the inference and the selection of incoherent candidate content such as (Song, Rectangular) and ( Song, marriedTo, Person).
To filter out incoherent suggestions made by the add operations we propose the following models 2 .
Concept compatibility model (DISCOMP ). As explained in (Kruszewski and Baroni, 2015), distributional semantic representations provide models for semantic relatedness and have shown good performance in many lexical semantic tasks (Baroni et al., 2014). While they model semantic re-latedness, for instance, car and tyre are related concepts, they fail to capture the notion of semantic compatibility. That is, there is no thing that can be both a car and a tyre at the same time. Thus, they propose a Neural Network (NN) model that learns semantic characteristics of concepts classifying them as (in)compatible. We adapt their best performing model, namely 2L-interaction, for our task of detecting whether two ontology concepts (s, s ) are incompatible.
Selectional compatibility model (DRCOMP ). Selectional constraints concern the semantic type imposed by predicates to the arguments they take. For instance, the predicate sell will impose the constraint for its subjects to be, for instance, of type Organisation or Person. Thus, it would be acceptable to say A car dealer sells new cars while it would be rare to say A tyre sells new cars.
Our idea is to apply the notion of selectional preferences to ontology relations and the concepts they can be combined with. That is, whether a candidate relation p to be attached to a node labelled with concept s, i.e. forming the triple (s, p, o) 3 , is a plausible candidate. Along the lines of the work in (Van de Cruys, 2014), we train a NN model to predict (in)compatible subject concept -relation (s, p) pairs 4 .

Experimental setup
Both models use the best performing word vectors available at http://clic.cimec.unitn. it/composes/semantic-vectors.html (Baroni et al., 2014). DISCOMP dataset. This dataset consists of compatible and incompatible example pairs. We extract them in the following way. We combine a set of manually annotated pairs with a set of automatically extracted ones.
As manually annotated examples, we use the dataset of (Kruszewski and Baroni, 2015) plus additional examples extracted from the results of dif-ferent runs of the add operation which were annotated manually. These provide 7764 examples.
In addition, we automatically extracted compatible and incompatible pairs of concepts from existing ontologies. For incompatible pairs (5273 examples), we extracted definitions of disjoint axioms from 52 ontologies crawled from the web and from YAGO (Suchanek et al., 2008). The compatible pairs (57968 examples) were extracted from YAGO using the class membership of individuals. We assume that if an instance a is defined as a member of the class A and of the class B at the same time then both classes are compatible.
The final dataset contains 71918 instances. We take 80% for training and the rest for testing. DRCOMP dataset. We automatically extract subject-predicate pairs (s, p) from two different sources, namely nsubj dependencies from parsed sentences and domain restrictions in ontologies.
For the extraction of pairs from text, we use the ukWaCKy corpus (Baroni et al., 2009), we call this subset of pairs ukWaCKy.SP, and the Matoll corpus (Walter et al., 2013), call it the WikiDBP.SP subset. Both corpora contain dependency parsed sentences. In addition, the Matoll corpus provides annotations linking entities mentioned in the text with DBPedia entities. For the first SP dataset, we take the head and dependent participating in nsubj dependency relations as training pairs (s, p). For the second SP dataset, we use the DBPedia annotations associated to nsubj dependents. That is, we create (s, p) pairs where the s component rather than being the head entity mention, it is the DBPedia concept to which this entity belongs to. We do this by using the DBPedia entity annotations present in the corpus. For instance, given the dependency nsubj(Stan Kenton, winning), because Stan Kenton is annotated with the DBPedia entity http://dbpedia. org/resource/Stan_Kenton and this entity is defined to be of type Person and Artist, among others, we can create (s, p) pairs such as (person, winning) and (artist, winning).
For the pairs based on ontology definitions, we use the 52 ontologies crawled from the web. We call this subset of pairs KB.SP.
For training the model, we generate negative instances by corrupting the extracted data. For each (s, p) pair in the dataset we generate an (s , p) pair where s is not seen occurring with p in the training corpus. The final dataset contains 610522 training instances (30796 from ukWaCKy.SP, 571564 from WikiDBP.SP and 8162 from KB.SP ). We take out 600 cases, 300 from ukWaCKy.SP and 300 from KB.SP, for testing the model on specific text and KB pairs.

Evaluation
We separately evaluate the performance of each model in a held out testing set. Table 1 shows the results for the DISCOMP model. Table 2 shows the results obtained when evaluating the DRCOMP model. Both models perform well in the intrinsic evaluation.
Test dataset Accuracy (Kruszewski and Baroni, 2015) 0.72 DISCOMP 0.98 Table 1: Results reported by (Kruszewski and Baroni, 2015) and results obtained with the DISCOMP model.  We also asses the performance of the models on the task of meaningful query generation. We run the random query generation process over 5 ontologies of different domains, namely cars, travel, wines, conferences and human disabilities. At each query expansion operation, we apply the models to the sets of candidate concepts or relations. We compare the DISCOMP and DRCOMP models with a baseline cosine similarity (COS ) score 5 . For this score we use GloVe (Pennington et al., 2014) word embeddings and simple addition for composing multiword concept and relation names. We use a threshold of 0.3 that was determined empirically 6 . During the query generation process, we registered the candidate sets as well as  Table 3: Precision (P), recall (R), F-measure (F), specificity (S) and accuracy (A) results for the DISCOMP , DRCOMP and COS on the add compatible relation (addRelation) and add compatible concept (addCompatible) query expansion operations.
the predictions of the models. In total, we collected 67 candidate sets corresponding to the add compatible relation query extension and 39 to the add compatible operation. The candidate sets were manually annotated with (in)compatibility human judgements. We use these sets as gold standard to compute precision, recall, f-measure and specificity measures on the task of detecting incompatible candidates as well as the accuracy of the models. Figure 3 shows one example for each of the query expansion operations, the annotated candidates and the predictions done by each of the models (only incompatibles are shown). Table 3 shows the results. Unsurprisingly, given the quite strong similarity threshold used for the COS baseline, we observe that it has good precision at spotting incompatible candidates though quite low recall. In contrast, as shown by the f-measure values the compatibility models seem to achieve a better performance compromise for these measures. We include the specificity measure as an indicative of the ability of the models to avoid false alarms, that is, to avoid predicting a candidate as incompatible when it was not.

Conclusions and future work
We applied two compatibility models to get around the lack of disjointness and domain restrictions in ontologies and facilitate the (semi-) automatic generation of a large set of sensible user KB queries. These compatibility models were previously proposed for two semantic tasks. One for term compatibility (Kruszewski and Baroni, 2015) and the other for selectional preference modelling (Van de Cruys, 2014). We automatically created training datasets from several text and knowland the COS baseline with 0.5. Setting this threshold is really a trade off between precision and recall. The use of the 0.5 threshold resulted in rejection of most of the candidates including compatible ones.  edge base resources with the intention of providing more adequate training signal for our specific task.
As future work, we aim at running a larger task based extrinsic evaluation of these models. We plan to generate a set of KB queries, verbalise them using techniques proposed in  and ask for human judgements about meaningfulness of the generated queries. In this larger evaluation, we plan to test the models on larger general purpose KBs such as DBPedia.
Further work for improving on the current results could explore the adaptation of the models to specific domain vocabularies and the use of better composition modelling for multiwords concepts and relations.