Distributional semantics for ontology verification

As they grow in size, OWL ontologies tend to comprise intuitively incompatible statements, even when they remain logically consistent. This is true in particular of lightweight ontologies, especially the ones which aggregate knowledge from different sources. The article investigates how distributional semantics can help detect and repair violation of common sense in consistent ontologies, based on the identiﬁcation of consequences which are unlikely to hold if the rest of the ontology does. A score evaluating the plausibility for a consequence to hold with regard to distributional evidence is deﬁned, as well as several methods in order to decide which statements should be preferably amended or discarded. A conclusive evaluation is also provided, which consists in extending an input ontology with randomly generated statements, before trying to discard them automatically.


Introduction
Ontology learning from texts deals with the automated extraction of knowledge from linguistic evidence. This article investigates a slightly different problem, which is how Natural Language Processing may provide hints for the identification of statements of an input ontology which are unlikely to hold if the rest of it does. As a minimal example, consider the following set ∆ of statements, from DBpedia (Mendes et al., 2012), and assume that ∆ is a subset of a larger set of statements K (for instance DBpedia itself, or some subset of it) : (2) keyPerson(BrookField Office Properties, Peter Munk) (3) occupation(Peter Munk, CEO) } There is a clear violation of common sense in ∆ : the individual CEO must be both a key person of Caixa Bank, and the occupation of another individual (Peter Munk), who is himself a key person of some company. Detecting such cases within (larger) sets of logical statements is of particular interest in OWL, which facilitates the aggregation of knowledge from multiple sources with overlapping signatures, yielding datasets in which several incompatible understandings of a same individual or predicate may coexist. This easily leads to undesired inferences, even when the dataset is logically consistent. 1 But as the example illustrates, the problem may also occur within a single knowledge base, especially if it has been built semi-automatically, and/or is issued from a collaborative effort.
Another problem of interest consists in deciding which statement(s) should be preferably discarded or amended in order to get rid of the nonsense. In example 1, without further information, it would be intuitively relevant to discard or modify either (1) or (2). Unfortunately though, ∆ alone does not give any indication of which of the two should be preferably discarded. But the whole input ontology K ⊃ ∆ may. To keep the example simple, let us assume that Peter Munk, CEO and occupation do not appear in K \ ∆. Then a reasonable assumption is that the overall understanding of keyPerson within K should be the decisive factor. If it generally ranges over person functions (i.e. if in most instances of the relation according to K, the second argument is a person function), then it is to be understood as "has as a key person someone whose function is", and (2) should be preferably discarded. Alternatively, if keyPerson generally ranges over human beings, then (1) should be preferably discarded.
The article investigates the use of linguistic evidence to solve both of these problems : identifying violations of common sense, and selecting the statement(s) to be preferably amended or discarded. This may be viewed as a small paradigm shift, in that it questions an assumption commonly made in the knowledge extraction literature, namely that manually crafted knowledge strictly prevails over the one obtained from linguistic sources. By default, the case of a consistent 2 input ontology K will be studied, but section 6 discusses the application of the approach to an inconsistent K as well.
As a concrete contribution, section 5 evaluates the adaptation of relatively simple techniques issued from named entity classification/ontology population, and based on distributional semantics. To illustrate how this works, let us assume that the only other appearance of keyPerson within K is the following OWL statement : (4) hasRange(keyPerson, Person) i.e. in FOL : (4) ∀xy(keyPerson(x, y) → Person(y)) Then K |= ψ 1 = Person(CEO), and K |= ψ 2 = Person(Peter Munk). Assume also that there are other instances of Person according to K, and that most of them are actually human beings (like Peter Munk). Then ψ 1 is an undesirable consequence of K, whereas ψ 2 on the other hand reinforces it.
Distributional semantics characterizes a word (or possibly a multi word unit) by some algebraic representation of the linguistic contexts with which it is observed. These representations have already been used for ontology population, for instance by (Tanev and Magnini, 2008), the main intuition being that individuals denoted by linguistic terms with similar contexts tend to instantiate the same classes. The underlying linguistic phenomenon is known as selectional preference, i.e. the fact that some contexts tend to select or rule out certain categories of individuals : e.g. the context "X was born in" tends to select a human being, whereas "X was launched" tends to rule it out. Back to the example, one can expect the similarity between the distributional representation of the term "C.E.O" and other terms denoting instances of Person according to K to be relatively low, hindering the plausibility of ψ 1 with regard to K. In other words, ψ 1 should stand as an outlier among consequences of K, and therefore is probably undesirable. Conversely, the similarity between "Peter Munk" and terms denoting other instances of Person should be relatively high. For simplicity, suppose that (1), (2), (3) and (4) are the only 4 statements of K which are candidate for removal. Then in order to give up the belief in ψ 1 while preserving ψ 2 , it is necessary to discard (1), and retain (2) and (4). It is also sufficient to discard (1), i.e. discarding (3) as well would result in an unnecessary information loss. So in this case, the evidence provided by distributional semantics should suggest the removal of (1), or at least its modification, which is also intuitively the correct solution.
Section 4 formalizes this approach, by defining a score which estimates the plausibility of some consequences a subbases Γ of K, given distributional evidence. Section 5 then provides an original evaluation of this strategy, based on the prior extension of a small OWL ontology with randomly generated statements. The approach is evaluated for both problems, i.e. the identification of undesired consequences and statements. Performances of several forms of distributional representations are also compared. Section 6 discusses immediate applications, in particular for (consistent and inconsistent) ontology debugging. Finally, section 7 considers possible extensions of this framework, as well as their limitations. Section 2 is a brief overview of related works in the fields of ontology learning and debugging, whereas section 3 introduces notational conventions, and lists some preliminary requirements to be met by the input K.

31
Ontology learning from texts (Cimiano, 2006;Buitelaar et al., 2005) aims to automatically build or enriching a set of logical statements out of linguistic evidence, and is closely related to the field of information extraction. The work presented here borrows from a subtask called ontology population (which itself borrows from named entity classification), but only when the individuals and concepts of interest are already known (Cimiano and Völker, 2005;Tanev and Magnini, 2008;Giuliano and Gliozzo, 2008), which is not standard. A comparison may also be drawn with the use of linguistic evidence by (Suchanek et al., 2009) for information extraction in the presence of conflicting data.
But the objective of the present work is different, pertaining to ontology debugging, which covers a wide range of techniques, from syntactic verifications (Poveda-Villalón et al., 2012) to anti-patterns detection (Roussey and Zamazal, 2013), both based on common modeling mistakes, or the submission of models (Ferré and Rudolph, 2012;Benevides et al., 2010) or consequences (Pammer, 2010) of the input ontology to the user. As discussed in section 6, the framework depicted here presents an interesting complementarity with debugging techniques developed in the Description Logics community, prototypically based on diagnosis (Friedrich and Shchekotykhin, 2005;Kalyanpur et al., 2006;Qi et al., 2008;Ribeiro and Wassermann, 2009), because they require the prior identification of some undesired consequence of K (be it ⊥). But distributional evidence may also provide a principled way of selecting most relevant diagnoses among a potentially large number of candidates, as well as an alternative to their exhaustive computation, which has been shown costly by (Schlobach, 2005).

Conventions and presuppositions
The prototypical input is a set of statements in OWL DL or OWL 2, although the approach may be generalized to other representation languages. OWL DL and OWL 2 are based on Description Logics (DL), which are themselves decidable fragments of firstorder logic (FOL). The OWL notation is preferred to the DL one for readability, and FOL translations are given when not obvious.
An ontology is just understood here as a (finite) set of logical statements. A class will designate a named class in OWL, i.e. a FOL unary predicate, like Person, whereas a named individual, or just individual, designates a constant, like Peter Munk.
The input ontology K must provide English terms denoting some of its named individuals (e.g. the term "Peter Munk"). These terms are prototypically named entities, but may also occasionally be common nouns (or common noun phrases), as shown in example 1 with "C.E.O". There may be multiple terms for a same individual. The approach cannot handle polysemy though, in particular the fact that some individuals of K may have homonyms (within K or not), for instance that the term "JFK" can stand for a politician, airport or movie. Ideally, no distributional representation should be built for individuals of K with potential homonyms. Some of them may be identified with simple strategies, like checking the existence of a Wikipedia disambiguation page. On the opposite, labels for classes of K (prototypically common nouns or common noun phrases, which are arguably more ambiguous) are never used during the process.

Proposition
Given a subbase Γ of the input ontology K (possibly K itself), the ontology verification strategy presented in introduction relies on the evaluation of a set Ψ Γ of consequences of Γ. This section first defines a score sc Γ (ψ) for each ψ ∈ Ψ Γ , which intuitively evaluates the plausibility of ψ wrt Γ, provided some distributional representation for each named individual appearing in Ψ Γ . Then it discusses how this score can be used to select statements of the input ontology K which, according to distributional evidence, should be preferably discarded, or at least amended.

Plausibility of a consequence ψ ∈ Ψ Γ
For the experiments described in section 5, Ψ Γ is the set of consequences of Γ of the form A(e) or ¬A(e), with e a constant (like CEO) and A a unary predicate (like Person), and for which linguistic occurrences of a term denoting e could be retrieved. Possible extension of Ψ Γ with other types of formulas is discussed in section 7.
Let ψ be a formula of Ψ Γ , of the form A(e), e.g. ψ = Person(CEO) or ψ = Person(Peter Munk). Then inst Γ (A) will designate all instances of A according to Γ for which linguistic occurrences could be retrieved, i.e. inst Γ (A) = {e | A(e ) ∈ Ψ Γ }, and inst Γ (A) \ {e} will be called the support set for A(e). Similarly, inst Γ ( ) will designate all named individuals appearing in Ψ Γ .
Let sim(e 1 , e 2 ) be a measure of similarity between the distributional representations of individuals e 1 and e 2 (prototypically the cosine similarity between some vector representations of the linguistic contexts of e 1 and e 2 ). Then for each e ∈ inst Γ (A) \ {e}, if sim(e, e ) is lower than what could be expected if e was a random individual of inst Γ ( ) \ {e} (i.e. not necessarily an instance of A), the hypothesis that A(e) is an outlier within Ψ Γ will be reinforced.
For instance, in example 1, let ψ = Person(CEO) and Γ = K. Then the support set inst Γ (A) \ {e} is composed of all other instances of Person according to Γ. For each individual e of this support set, if sim(CEO, e ) is lower than what can be expected for a random individual of K with linguistic occurrences (and different from CEO), then the confidence in Person(CEO) should decline. Conversely, if sim(e, e ) is higher that expected, the hypothesis that ψ is in line with Ψ Γ will be reinforced.
Here is a cost-efficient and relatively simple method to compute a plausibility score sc Γ (A(e)). models what the average similarity between e and these individuals can be expected to be. Then the plausibility sc Γ (A(e)) of A(e) can be defined by : sc Γ (A(e)) estimates of how surprisingly high the similarity between e and the individuals of the support set S is, considering the overall similarity between e and the individuals of Γ.
For the evaluation described in section 5, the random variable X Γ e,|S| was assumed to follow a beta distribution Beta(α, β), which intuitively allows taking the size |S| of the support set into account. For instance, if S = {e }, i.e. |S| = 1, then ceteris paribus a high similarity between e and e will be less informative than an equally high average similarity between e and all elements of a large S. Stated another way, the lower |S| is, the more uniform the distribution of X Γ e,|S| should be. This can be obtained by setting X Γ e,|S| ∼ Beta(m|S| + 1, (1 − m)|S| + 1), where m is the average similarity between e and all other individuals of the signature of Γ, i.e. m = e ∈Γ\{e} sim(e,e ) |Γ|−1 .
A possible interrogation here is the choice of inst Γ (A) \ {e} as the support set for A(e). For instances, if ψ = Person(Peter Munk), a case could be made for using inst Γ (¬A) as well, i.e. for exploiting the (dis)similarity between Peter Munk and individuals which, according to K, are instances of ¬Person. 3 This is quite unrealistic though from a linguistic point of view, which can be intuitively seen in this example by replacing Peter Munk with CEO. Assume for instance that Thelonious Monk and Beijing are (reliable) instances of Person and ¬Person respectively according to Γ. There is no reason to expect that sim(CEO, Beijing) > sim(CEO, Thelonious Monk). In other words, it is implausible to assume that elements of inst Γ (¬A) should a priori share similar contexts.
Interestingly enough, and for the same reason, the support set for a consequence of Γ of the form ¬A(e) is not inst Γ (¬A), but inst Γ (A), which yields :

Linguistic compliance of Γ
This does not directly address the second problem mentioned in introduction though. For practical ontology verification, it is also desirable to identify the cause of this nonsense, i.e. statements (axioms in the DL terminology) which are intuitively problematic. For instance, in example 1, computing sc Γ (ψ) for each ψ ∈ Ψ K may signal that the consequence ψ 1 is unlikely to hold wrt the larger ontology K. And discarding either (1) or (4) is sufficient to get rid of the belief in ψ. But given the additional assumptions made about K, discarding the former is preferable, in that discarding the latter would also result in the loss of ψ 2 . In other words, some subbases of K (like K \(1) here) are more relevant than others (e.g. K \ (4)), which can be simply captured as follows.
Let comp(Γ) be an estimation of the compliance of a subbase Γ of K with the gathered linguistic evidence. A straightforward option consists in setting comp(Γ) to be the mean of the scores of evaluated consequences for Γ, i.e. : Then a strict partial order ≺ over 2 K can simply be defined by Γ 1 ≺ Γ 2 iff either comp(Γ 1 ) < comp(Γ 2 ), or (comp(Γ 1 ) = comp(Γ 2 ) and Γ 1 ⊂ Γ 2 ), 4 and a subbase Γ of K can be viewed as optimal if it is maximal wrt ≺. 5 In practice though, identifying optimal subbases is a non trivial task. To see this, note that the function to be maximized is not directly a function of the statements in Γ, but of Ψ Γ , i.e. some of the consequences of Γ. So even if one could identify a subset Ψ of Ψ K which maximizes this function, there may not exist a subbase Γ of K such that Ψ Γ = Ψ . Another difficulty comes from the fact that for two subbases Γ 1 and Γ 2 of K, and a consequence ψ ∈ Ψ Γ 1 ∩ Ψ Γ 2 , it doesn't hold in general that sc Γ 1 (ψ) = sc Γ 2 (ψ), because the support set for ψ in Γ 1 may differ from its support set in 4 The assumption is made that a minimum of syntactic information should be lost whenever possible, i.e. Γ1 and Γ2 are primarily viewed as bases, not as theories. In particular, if Cn(Γ1) = Cn(Γ2), but Γ1 ⊆ Γ2 and Γ2 ⊆ Γ1, then Γ1 and Γ2 are not comparable wrt ≺. Redundancies in this view should also be preserved when possible, i.e. if Cn(Γ1) = Cn(Γ2) and Γ1 ⊂ Γ2, then Γ1 ≺ Γ2 still holds. 5 There may be several several optimal subbases. Γ 2 . In particular, it may be the case that Γ 1 ⊆ Γ 2 but sc Γ 1 (ψ) > sc Γ 2 (ψ), which greatly reduces the possible uses of monotonicity (if Γ 1 ⊆ Γ 2 , then Cn(Γ 1 ) ⊆ Cn(Γ 2 )) to optimize the exploration of 2 K . More generally, if the optimal subbases of K are small (say twice smaller that K), it can be rightfully argued that dropping so many statements for the sake of linguistic evidence is not a viable debugging strategy. Therefore a more plausible application scenario is one in which the search space has been previously circumscribed, either by setting a maximal (small) number of statements to discard, or by identifying a set of potentially erroneous statements, through axiom pinpointing, as explained in section 6. This is also why the evaluation presented in section 5 focuses on the simplest possible case, i.e. the removal from K of one statement only, whereas the integration of distributional evidence to more complex debugging strategies is discussed in section 6.
As an alternative to the function comp, and in order to avoid the fact that a same consequence may have different plausibility scores wrt two subbases of K, one may choose to discard unlikely consequences based on their respective scores in K, i.e. to use the score comp K (Γ), 6 defined by : This solution is arguably less satisfying, but more amenable to optimizations. A trivial example is that of a subbase Γ 1 with max for some already evaluated subbase Γ 2 , in which case no subbase of Γ 1 can be optimal wrt ≺.
Additionally, instead of taking the mean of the scores of evaluated consequences of Γ, one may want to penalize the subbases of K with the most unlikely consequences, which gives a standard (total) lexicographic ordering lex on 2 K , defined as follows. Let ω Γ = ω 1 Γ , .., ω |Ψ Γ | Γ be the vector of formulas of Ψ Γ order by increasing score sc Γ , and let sc Γ (ω Γ ) = sc Γ (ω 1 Γ ), .., sc Γ (ω . Then as previously, a strict partial order ≺ over 2 K can be defined by Γ 1 ≺ Γ 2 iff either Γ 1 ≺ lex Γ 2 , or (Γ 1 = lex Γ 2 and Γ 1 ⊂ Γ 2 ). Again, sc K (ψ) may be used instead of sc Γ (ψ), yielding the lexical ordering lex K . This last possibility corresponds to a relatively intuitive operation, which consists in giving up in priority the most implausible consequences of K. All four possibilities are evaluated in what follows.

Evaluation
The dataset used for this evaluation is a fragment of the fisheries ontology from the NEON project. 7 It has been automatically built out of 10 randomly selected named individuals, applying a module extraction procedure, followed by a trimming algorithm. The fragment contains 1038 (logical) statements, and involves 71 named individuals (mostly geographical or administrative entities), the least expressive underlying DL being SI.
The linguistic input is a small corpus of approximately 6300 web pages, retrieved with a search engine, using the labels of named individuals of F as queries. The HTML documents were cleaned with the BootCat library ( Baroni and Bernardini, 2004).
The construction of the distributional representations of the named individuals of F was basic, the use of more elaborate methods (SVD,. . . ) being left for future work. The approach presented in this article remains generic enough to be applied to most existing distributional frameworks, the only requirement being a real-valued similarity measure.
Two different forms of linguistic contexts were alternatively tested. The first option considers as a context any n-gram (2 ≤ n ≤ 5) without punctuation mark which immediately precedes or follows a term t denoting an individual of F . The other option is a more customized one, extracting sequences of lemmatized words (lemmaPOS in what follows) surrounding t, in a shifting window of 3 to 5 tokens + the size of t, ignoring certain categories of word. Part-of-speech tagging was performed thanks to the Stanford Parser (Toutanova et al., 2003), with a pre-trained model for English. If Cont designates the set of contexts observed with at least 2 individuals, then an individual was rep-7 http://www.neon-project.org/nw/Ontologies resented by the vector of its respective frequencies with each context c ∈ Cont. Different possibilities were compared to weight these frequencies. The pointwise mutual information (PMI) was used in a standard way for n-grams and lemmaPOS contexts (with possible negative resulting frequencies set to 0). Following (Giuliano and Gliozzo, 2008), the self-information self(c) was also used for n-grams, defined by self(c) = − log p(c), the probability p(c) being estimated thanks to the Microsoft Web N-gram Services. A combined weighting by PMI and self-information was also tested for n-grams. These alternative settings are represented by capital letters in tables 1 and 2 : LP for lemmaPOS with PMI, and NP, NS and NPS for n-grams with PMI, self-information and both respectively.
The ontology F has been extended for the sake of the evaluation, with statements randomly generated out of its signature. The underlying assumption is that adding such statements to F is very likely to generate violations of common sense (although nothing prevents in theory the generation of plausible statements too). The goal for the evaluation was then to automatically retrieve proper consequences of each extension of F on the one hand, and the random statements themselves on the other hand.
To prevent any misunderstanding, it should be emphasized that this is not a realistic application case. The input ontology was selected for its quality, and degraded through random statement generation, allowing an arguably artificial, but also very objective evaluation procedure (the only bias may come from randomly generated statements which are actually plausible). By contrast, using a non modified input dataset, and evaluating whether or not the axioms/consequences spotted by the algorithm are actually erroneous is a complex and subjective task, with a possibly low inter-annotator agreement.
The generation procedure randomly selects a statement φ ∈ F , and yields a statement φ with the same syntactic structure as φ, but in which individuals and predicates have been replaced by random individuals and predicates appearing in F . For instance, if φ = ∀xy(A(x) ∧ r(x, y) → ¬B(y)), then φ = ∀xy(C(x) ∧ s(x, y) → ¬D(y)), with C and D (resp. s) randomly chosen among classes (resp. binary predicates) of the signature of F . 100 randomly generated statements φ 1 , . . . , φ 100  Table 1: Average ranking among Ψ Ki of the lowestranked formula of Ψ rand Ki , and p-value for the rankings of all formulas of all Ψ rand Ki were added independently to F , yielding 100 input ontologies K 1 , . . . , K 100 , such that each K i was consistent, and that there was at least one consequence of the form A(e) or ¬A(e) entailed by K i but not by F , with e sharing at least one linguistic context with some other individual of F . All 100 input ontologies are available online. 8 The first part of the evaluation was performed as follows. For each K i and each ψ ∈ Ψ K i , the plausibility sc K i (ψ) was computed as in definitions 4.1/4.2, and Ψ K i was ordered by increasing plausibility. 9 Within Ψ K i are consequences which were not initially entailed by F , but have been obtained after the extension of F with the random statement φ i . So in a sense, these consequences are randomly generated too, and therefore one may expect many of them to convey absurd information (for instance Architect(Belgium)), or at least to be outliers (like Person(CEO) in ex 1) within Ψ K i . Let Ψ rand K i designate these additional consequences, i.e. Ψ rand K i = Ψ K i \ Ψ F . If ψ ∈ Ψ rand K i , and if sc K i (ψ) is actually lower than for most other formulas of Ψ K i , this would indicate that the plausibility score, as formulated in definitions 4.1/4.2, is actually a good estimator.
In order to evaluate this, column "rank" in table 1 gives the average ranking (for all 100 ontologies) within Ψ K i of the formula ψ i ∈ Ψ rand K i with lowest score. The lower this ranking, the more efficient the plausibility score is at detecting outlier consequences. Column "pVal" gives the probability (ttest) for the cumulated rankings of all formulas in all Ψ rand K i to be as low as the observed ones, if all consequences in all Ψ K i had been randomly ordered.
Results are convincing, with a significant p-value for all four settings. For most ontologies (75/100), there was only one formula in Ψ rand K i . A closer look at the data revealed that, for the best setting (LP), in most of theses cases (57/75), the only formula in Ψ rand K i was also the one with lowest plausibility in Ψ K i , over 216.1 on average, i.e. the only randomly generated consequence was also the least plausible one according to linguistic evidence. This is very encouraging, especially considering the relatively small number of named individuals (71) in F , i.e. the fact that the support to evaluate the plausibility of a consequence ψ ∈ Ψ K i was limited. On the other hand, performances were generally poor when the cardinality of Ψ rand K i was important (> 0.25 * |Ψ K i |), which may be explained by the fact that support sets for some classes of F were significantly modified after the extension of F with φ i .
As for the settings, unsurprisingly, the two most beneficial (but unfortunately incompatible) factors were the use of lemmatized contexts on the one hand (LP), and the queries over the Web N-gram corpus on the other hand (NS and NPS) The second part of the evaluation focused on the retrieval of the random statements φ 1 , .., φ 100 , for the LP setting only, because it gave the best results in the previous experiment. For each extended base K i , all immediate subbases Γ i,1 , .., Γ i,|F |+1 of K i were generated, i.e. each Γ i,j was such that K i = Γ i,j ∪ {φ j } for some statement φ j of K i . The different Γ i,j were ordered by decreasing compliance score comp(Γ i,j ) (resp. comp K i (Γ i,j )), or by decreasing lexicographic ordering lex (resp. lex K i ). 10 Intuitively, this yields a ranking on K i where the least reliable statements wrt linguistic evidence should appear first : if φ j ∈ K i , and if the subbase of K i obtained by discarding φ j (i.e. Γ i,j ) has a higher linguistic compliance score than K i , then discarding Γ i,j can be viewed as an improvement over K i . And if Γ i,j is among the best ranked subbases of K i , then φ j is among the least reliable statements of K i wrt distributional evidence. For instance, in example 1, one may expect the subbase K \ (1) to have a maximal linguistic compliance score among immediate subbases of K (or to be rank p-val comp(Γ) 7.86 / 80.03 < 0.001 comp K i (Γ) 8.05 / 80.03 < 0.001 lex 6.51 / 80.03 < 0.001 lex K i 2.47 / 80.03 < 0.001 Table 2: Average ranking of the randomly generated statement φ i for each K i , and p-value for the rankings of all φ i maximal wrt the lexicographic ordering), such that (1) is the best candidate for removal. So back to the test data, if K i = F ∪ {φ i }, i.e. if φ i is, among the |F +1| statements of K i , the one which has been randomly generated, and if Γ i,i = K i \ φ i is among the best ranked immediate subbases of K i , this would indicate that the linguistic compliance score in definitions 4.3 (resp. 4.4), or the corresponding lexicographic ordering lex (resp. lex K i ) is actually a good estimator of faulty statements.
An additional precaution was taken in order to avoid artificially good results. For most statements φ j ∈ K i , discarding φ j did not have any impact on the set Ψ Γ i,j of consequences to be evaluated, i.e. Ψ Γ i,j = Ψ K i , and therefore comp(Γ i,j ) = comp(K i ). Let ∆ i ⊆ K i be the set of statements whose removal did have an impact instead (on average, there were 79.3 statements in ∆ i ). Then the compliance of a subbase Γ i,j of K i was evaluated only if φ j ∈ ∆ i , i.e. only if the removal of φ j made a difference. K i was also added to this set of evaluated subbases, yielding a ranking of 79.03 + 1 = 80.03 bases on average.
Results are again positive. Column "rank" in table 2 gives the average ranking of Γ i,i , i.e. the base obtained after the removal of the randomly generated statement φ i . Both lexicographic orderings outperformed the compliance scores (i.e. the mean of plausibility scores), and the best configuration was the fourth presented in section 4.2, using sc K i (ψ) as a plausibility score instead of sc Γ i,j (ψ).

Applications
This section describes a few concrete use cases of the propositions made in section 4. A first basic but useful application is the identification of undesired consequences of a consistent input ontology K. As illustrated by example 1, violations of common sense often go unnoticed in publicly available OWL datasets, even though effective procedures can detect inconsistency 11 in most DLs. This is correlated with the overall sparse usage of negation in OWL, yielding ontologies which are consistent by default rather than by design. The identification of such cases can be very simply performed, by returning to the user the formulas of Ψ K with lower plausibility scores, like Person(CEO) in example 1. Axiom pinpointing algorithms (Schlobach and Cornet, 2003;Kalyanpur et al., 2007;Horridge, 2011) may then be used to compute all justifications for each returned consequence ψ, i.e. all (set-inclusion) minimal subsets of K which have ψ as a consequence.
In a more automated fashion, the greedy trimming approach described in (Corman et al., 2015) returns n statements of K which are candidate for removal, n being given as a parameter, by incrementally selecting the immediate subbase of Γ with maximal linguistic compliance score, starting with Γ = K.
But inconsistent 12 ontology debugging may also benefit from distributional evidence. As discussed in section 2, state-of-the-art approaches to ontology debugging suffer from the number of candidate outputs, i.e. of (set-inclusion) maximal consistent subsets of K, as well as from the cost of their computation. If the set J of justifications for the inconsistency of K is known though, and if some (discriminant enough) preference relation a over J can be obtained, then prioritized base revision, as it is defined in (Nebel, 1992), provides a principled and computationally attractive solution to these problems. Even if the whole process cannot be depicted here, a may actually be obtained through distributional evidence, by evaluating, for each statement φ ∈ J , the plausibility of some consequences of candidate subbases in which φ does or does not appear. The support set in this case is reduced to consequences of the "safe" part of K, i.e. K \ J .

Extensions
A first straightforward extension of this framework consists in taking more complex classes into ac-count. OWL (and most Description Logics) favor the recursive construction of arbitrarily complex classes out of the signature of Γ, and this mechanism could naturally be used to extend Ψ Γ with more consequences of the form C(e), where C is one of these complex classes. For instance, in example 1, if C 1 and C 2 are respectively defined by ∀x(C 1 (x) ⇔ ∃y(occupation(y, x)) and ∀x(C 2 (x) ⇔ ∃y(occupation(x, y)), then Ψ K can be extended "for free" with C 1 (CEO) and C 2 (Peter Munk). Unfortunately, if Ψ + Γ is the set of all consequences of Γ which can be built this way, there is in general no finite subset Ψ Γ of Ψ + Γ such that Ψ Γ |= ψ for all ψ ∈ Ψ + Γ . Therefore the complex classes to be used must be selected, which is not trivial. Intuitively, some complex classes are more relevant than other (e.g. the class of "physical objects owned by someone" may be linguistically relevant, but probably not "Moldavian or Muslim lawyers whose father lives in an apartment").
Another simple variation of the framework presented here consists in setting Ψ Γ to be all consequences of Γ of the form e 1 = e 2 , i.e. the fact that that e 1 and e 2 are not the same individual according to Γ. The unique name assumption is not made in OWL, which means that two distinct named individuals can be interpreted identically, and therefore these consequences do not hold by default. They may be explicitly stated in Γ (owl:differentIndividuals(e 1 , e 2 )), but are in most cases entailed by Γ, provided it contains some form of negation (e.g. instances of two disjoint classes cannot be the same individual). If Γ 1 and Γ 2 are two subbases of K such that Γ 1 |= e 1 = e 2 , but Γ 2 |= e 1 = e 2 , and if the similarity between e 1 and e 2 is lower than expected, then ceteris paribus, Γ 1 will be preferred to Γ 2 .

Conclusion
This article is centered on the use of distributional representations of (labels of) named individuals of an input ontology K, in order to identify and repair violations of commonsense within K. For a set of statements Γ ⊆ K, and Ψ Γ a specific set of consequences of Γ, a score sc Γ (ψ) is attributed to each ψ ∈ Ψ Γ , which evaluates the plausibility of ψ wrt Γ according to distributional evidence. Several meth-ods based on this plausibility score are then proposed in order to compare two subbases Γ 1 and Γ 2 of K, leading to the identification of potentially erroneous statements. An evaluation is provided, which consists in extending a test ontology with randomly generated statements before trying to spot them automatically, with significant results. A more thorough evaluation is still required though, testing in particular the impact of a higher number of named individuals and/or classes. Scalability of the approach may also be limited by its heavy reliance on a reasoner. Finally, potential improvements may come from using more elaborated distributional representations, like the one described in (Mikolov et al., 2013).