Learning Embeddings to lexicalise RDF Properties

A difﬁcult task when generating text from knowledge bases (KB) consists in ﬁnding appropriate lexicalisations for KB symbols. We present an approach for lexicalis-ing knowledge base relations and apply it to DBPedia data. Our model learns low-dimensional embeddings of words and RDF resources and uses these representations to score RDF properties against candidate lexicalisations. Training our model using (i) pairs of RDF triples and automatically generated verbalisations of these triples and (ii) pairs of paraphrases extracted from various resources, yields competitive results on DBPedia data.


Introduction
In recent years, work on the Semantic Web has led to the publication of large scale datasets in the socalled Linked Data framework such as for instance DBPedia or Yago. However, as shown in (Rector et al., 2004), the basic standards (e.g., RDF, OWL) established by the Semantic Web community for representing data and ontologies are difficult for human beings to use and understand. With the development of the semantic web and the rapid increase of Linked Data, there is consequently a growing need in the semantic web community for technologies that give humans easy access to the machine-oriented Web of data.
Because it maps data to text, Natural Language Generation (NLG) provides a natural means for presenting this data in an organized, coherent and accessible way. It can be used to display the content of linked data or of knowledge bases to lay users; to generate explanations, descriptions and summaries from DBPedia or from knowledge bases; to guide the user in formulating knowledge base queries; and to provide ways for cultural heritage institutions such as museums and libraries to present information about their holdings in multiple textual forms.
In this paper, we focus on an important subtask of generation from RDF data namely lexicalisation of RDF properties. Given a property, our goal is to map this property to a set of possible lexicalisations. For instance, given the property HASWONPRIZE, our goal is to automatically infer lexicalisations such as was honored with and received.
Our approach is based on learning lowdimensional vector embeddings of words and of KB triples so that representations of triples and their corresponding lexicalisations end up being similar in the embedding space. Using these embeddings, we can then assess the similarity between a property and a set of candidate lexicalisations by simply applying the dot product to their vector embeddings.
One difficulty when lexicalising RDF properties is that, while in some cases, there is a direct and simple relation between the name of a property and its verbalisation (e.g., BIRTHDATE / "was born on"), in other cases, the relation is either indirect (e.g., ROUTEEND / "finishes at") or opaque (e.g., CREW1UP / "is the commander of").
To account for these two possibilities, we therefore explore two main ways of creating candidate lexicalisations based on either lexical-or on extensional-relatedness. Given some input property p, lexically-related candidate lexicalisations for p are phrases containing synonyms or derivationally related words of the tokens making up the name of the input property. In contrast, extensionally-related candidate lexicalisations are phrases containing named entities which are in its extension. For instance, given the property CREW1UP, if the pair of entities (STS-130, GEORGE D. ZAMK) is in its extension (i.e., there exists an RDF triple of the form STS-130, CREW1UP, GEORGE D. ZAMK ), all sentences mentioning STS-130, GEORGE D. ZAMK or both will be retrieved and exploited to build the set of candidate lexicalisations for CREW1UP. Figure 1 shows some example L-and E-candidate lexicalisations phrases.
In summary, the key contribution made in this paper is a novel method for lexicalising RDF properties which differs from previous work in two ways. First, while lexical and extensional relatedness have been used before for lexicalising RDF properties (Walter et al., 2013), ours is the first lexicalisation approach which jointly considers both sources of information. Second, while previous approaches have used discrete representations and similarity metrics based on Wordnet, our method exploits continuous representations of both words and KB symbols that are learned and optimised for the lexicalisation task.

Related Work
We situate our work with respect to previous work on ontology lexicons but also to research on relation extraction (extracting verbalisations of knowledge base relations) and to embeddings-based approaches.
Ontology Lexicons (Trevisan, 2010) proposes a simple lexicalisation approach which exploits the tokens included in a property name to build candidate lexicalisations. In brief, this approach consists in tokenizing and part-of-speech tagging relation names with a customized tokenizer and partof-speech (PoS) tagger. A set of hand-defined mappings is then used to map PoS sequences to lexicalisations. For instance, given the property name HASADRESS, this approach will produce the candidate lexicalisation "the address of S is O" where S and O are place-holders for the lexicalisations of the subject and object entity in the input RDF triple. (Walter et al., 2013;Walter et al., 2014a;Walter et al., 2014b) describes an approach for inducing a lexicon mapping DBPedia properties to possible lexicalisations. The approach combines a label-based and a pattern-based method. The label-based method extracts lexicalisations from property names using additional information (e.g., synonyms) from external resources. The patternbased method extract lexicalisations from a text corpus by retrieving sentences containing entities that are related by a DBPedia property and generalising over the dependency paths that connect them using hand-written patterns and frequency counts.
While these approaches can be effective, (Trevisan, 2010)'s approach fails to account for "opaque" property names (i.e., property such as CREW1UP whose lexicalisation is not directly deducible from the tokens making up that property name) and the pattern-based approach of (Walter et al., 2013), because it relies on frequency counts rather than lexical relatedness, allows for lexicalisations that may be semantically unrelated to the input property. In contrast, we learn continuous representations of both KB properties and words and exploit these to rank candidate lexicalisations which are either lexically-or extensionally-related to the properties to be lexicalised. In this way, we consider both types of property names while systematically checking for semantic relatedness.
Relation Extraction Earlier Information Extraction (IE) systems learned an extractor for each target relation from labelled training examples (Riloff, 1996;Soderland, 1999). For instance, (Riloff, 1996) first extract relation mention patterns from the corpus then rank these based on the number of time a relation pattern occurs in a text labelled with the target relation.
More recent work on Open IE has focused on building large scale knowledge bases such as Reverb by extracting arbitrary relations from text (Wu and Weld, 2010;Mohamed et al., 2011;Nakashole et al., 2012).
While relation extraction can be viewed as the mirror task of relation lexicalisation, there are important differences. Our lexicalisation task differs from domain specific IE in that it is unsupervised (we do not have access to annotated data). It also differs from open IE in that the set of properties to be lexicalised is predefined whereas, by definition, in open IE, the set of relations to be extracted is unrestricted. That is, while we aim to find the possible lexicalisations of a given set of relations (here DBPedia properties), open IE seeks to extract an unrestricted set of relations from text. Nevertheless, (Nakashole et al., 2012) includes a clustering phase which permits grouping relation clusters with a predefined set of properties such as, in particular, DBPedia properties. In Section 6, we therefore compare our results with the lexical- isations output by (Nakashole et al., 2012)'s approach.
Embedding-based Approaches The model we propose is inspired by (Bordes et al., 2014). In (Bordes et al., 2014), low dimensional embedding of words and KB symbols are learned so that representations of questions and their corresponding answers end up being similar in the embedding space. The embeddings are learned using automatically generated questions from KB triples and a dataset of questions marked as paraphrases (WikiAnswers, ). We adapt this model to the lexicalisation task by generating noisy lexicalisations of KB triples using a simple generation approach and by exploiting different paraphrase resources (c.f. Section3). Our approach further differs from (Bordes et al., 2014) in that we combine this embedding based framework with a pre-selection of candidate lexicalisations which reflects knowledge about the property extension and the property name. As mentioned in Section 1, E-related candidate lexicalisation phrases are sentences mentioning subject and/or object of the property being considered for lexicalisation while L-related candidate lexicalisation phrases are phrases containing synonyms or derivationally related words of the token making up the name of that property. In this way, we provide a joint modelling of the impact of lexical and extensional similarity on lexicalisation.

Approach
Given a KB property p, our task is to find a set of possible lexicalisations L p for p. For instance, given the property HASWONPRIZE, our goal is to automatically infer lexicalisations such as was honoured with and received.

Lexicalisation Algorithm
Our lexicalisation algorithm is composed of the following steps: Embeddings Using distant supervision, we learn embeddings of words and KB symbols such that the representations of KB triples, of sentences artificially generated from these triples and of their paraphrases are similar in the embedding space.
Candidate Lexicalisations Using WordNet and the extension of RDF properties (i.e., the set of pairs of entities related by that property), we build sets of candidate lexicalisation phrases. "Subject Relation Object" phrases are extracted from the set of candidate sentences using Reverb . Reverb is a tool for Open IE which extracts relation mentions from text based on frequency counts and regular expression filters.
Ranking Using the dot product on embedding based representations of triples and candidate lexicalisation phrases, we rank candidate lexicalisations of properties.
Extractions We apply some normalisation rules on the relation mention of the ranked lexicalisations to eliminate "duplicates". These rules consist in a small set of basic patterns to detect and remove adverbs, adjectives, determiners, etc. For instance, given the following relation mentions always led by, is also led by and is currently led by only one version will be extracted that is led by. From the top ranked lexicalisation phrases according to some threshold (e.g. top 10), we extract the lexicalisation set L p for property p. Lexicalisations in L p are relation mentions from the ranked lexicalisation phrases.

Learning Words and KB symbols Embeddings
Similar to the work of (Bordes et al., 2014), we use distant supervision and multitask training to learn embeddings of words and KB symbols.
Training Set Generation We train on two datasets, one aligning KB triples with automatically generated verbalisations of these triples and the other, aligning paraphrases. The first dataset (T ) is used to learn a similarity function between KB symbols and words, the second (P) to account for the many ways in which a given property may be verbalised.
Triples and Sentences (T ) We build a training corpus of KB triples and Natural Language sentences by combining the pattern based lexicalisation approach of (Trevisan, 2010) (c.f. Section 2) with a simple grammar based generation step. We apply this approach to map KB property names to syntactic constructions and then use a simple grammar to generate sentences from KB triples. For instance, the triple in (1a) will yield the sentences in (1b-g): (1) a. On average, each property is associated with 5.9 sentences. Given a training pair (t, s) such that t = (s k , p k , o k ), we generate negative examples by corrupting the triple i.e., by producing pairs of the form (t , s) such that t = (s k , p k , o k ) and Paraphrases (P). To learn embeddings and a similarity function that takes into account the various ways in which a property can be lexicalised, we supplement our training data with pairs of paraphrases contained in the PPDB paraphrase database, in the WikiAnswers dataset and in DBPedia (DBPP). Positive examples (p i , p j ) are taken from these datasets and negative examples are produced by creating corrupted pairs (p i , p l ) such that p i is not in the paraphrase dataset of p l and vice versa.
The PPDB database was extracted from bilingual parallel corpora following (Bannard and Callison-Burch, 2005)'s bilingual pivoting method 1 . PPDB comes pre-packaged in 6 sizes: S to XXXL. The smaller packages contain only better-scoring, high-precision paraphrases, while the larger ones aim for high coverage. Additionally PPDB is broken down into lexical paraphrases (i.e. one word to one word), phrasal paraphrases (i.e. multi-word phrases), as well as syntactic paraphrases which contain non-terminals. We use PPDB version 2.0 M size lexical and phrasal sets which contain overall 3525057 paraphrase pairs. We choose to use medium size sets to incorporate some variability while still favouring higher quality paraphrases. As for the type of paraphrases, we took only the lexical and phrasal ones given that our goal is geared to acquiring alternative lexicalisations in terms of wording rather than syntactic variation.
Wikianswers is a corpus of 18M questionparaphrase pairs collected by (Fader et al., 2013), with 2.4M distinct questions in the corpus. Because these pairs have been labelled collaboratively, the data is highly noisy ((Fader et al., 2013) estimated that only 55% of the pairs were actual paraphrases).
Finally, the BDPP dataset consists of (entity, class) pairs extracted from the DBPedia ontology. They provide a bridge between the entity names appearing in the DBPedia triples and the more generic common nouns which may be used in text.
Using the resources and tools just described, we create a triple/sentence corpus T consisting of 317853 triple/sentence pairs obtained from 53384 KB triples of 149 relations. The paraphrase corpus P contains 3525057 (PPDB), 220998 (WikiAnswers) and 54489 (DBPP) paraphrase pairs. Training Using a training corpus created as described in the previous section, we learn a similarity function S between triples and candidate lexicalisations which is defined as: and g(s) = W .ψ(s) T (t, s) ( ARISTOTLE, INFLUENCED, CHRISTIAN PHILOSOPHY , "Christian philosophy who is influenced by Aristotle.") (t , s) ( ARISTOTLE, COMPUTINGMEDIA, CHRISTIAN PHILOSOPHY , "Christian philosophy who is influenced by Aristotle.") P (PPDB) (pi, pj) ("collaborate", "cooperate") (pi, p l ) ("collaborate", "improving") (pi, pj) ("is important to emphasize that", "is notable that") (pi, p l ) ("is important to emphasize that", "are using") P (Wikianswers) (pi, pj) ("much coca cola be buy per year", "much do a consumer pay for coca cola") (pi, p l ) ("much coca cola be buy per year", "information on neem plant") P (DBPP) (pi, pj) ("Amsterdam", "Place") (pi, p l ) ("Amsterdam", "Novels first published in serial form") K ∈ R n k ×d and W ∈ R nw×d are the embedding matrices for KB symbols and for words respectively with n k , the number of distinct symbols in the knowledge base and n w , the number of distinct word forms in the text corpus. Furthermore, φ(t) and ψ(s) are binary vectors indicating whether a KB symbol/word is present or absent in t/s. Thus, f (t) and g(s) are the embeddings of t and s and S t/s scores their similarity by taking their dot product.
To learn word embeddings which capture the similarity between a triple and a set of paraphrases (rather than just the similarity between a triple and artificially synthesised sentences), we multitask the training of our model with the task of paraphrase detection. That is, the weights of the W matrix for words are learnt with the training of the triple/sentence similarity function S t/s and the training of a similarity function S p for paraphrases which uses the same embedding matrix W for words and is trained on P, the paraphrase corpus. The phrase similarity function S p between two natural language phrases p i and p j is defined as follows: Similarly to (Bordes et al., 2014), we train our model using a margin-based ranking loss function so that scores of positive examples should be larger than those of negative examples by a margin of 1. That is, for S t/s , we minimize: where (t i , s i ) is a positive triple/sentence example and (t j , s i ) a negative one. Similarly, when training on paraphrase data, the ranking loss function to minimise is: (6) where (p i , p j ) is a positive example from the paraphrase corpus P and (p i , p l ) a negative one.

Implementation
The model is implemented in Python using the Keras (Chollet, 2015) library with Theano backend.
We initialise the W matrix with pre-trained vectors which already provide a rich representation for words. We use the publicly available GloVe (Pennington et al., 2014) vectors 2 of length 100. These vectors were trained on 6 billions words from Wikipedia and the English Gigaword. We set the dimension d of the K and W matrices to 100. For K we use uniform initialisation.
The size of the vocabulary for the W matrix, the n w dimension, is 130970 words. This is considering all words appearing in the T and P sets. The size of the K matrix, the n k dimension, is 43797 counting both KB entities and relations.
The training for both similarity functions S t/s and S p is performed with Stochastic Gradient Descent. The learning rate is set to 0.1 and the number of epochs to 5. Training run approximately 15 hours 3 .

Experiments
DBPedia 4 is a crowd-sourced knowledge based extracted from Wikipedia and available on the Web in RDF format. Available as Linked Dataon the web, the DBPedia knowledge base defines Linked Data URIs for millions of concepts. It has become a de facto central hub of the web of data and is heavily used by systems that employ structured data for applications such as web-based information retrieval or search engines.
Like many other large knowledge bases (e.g., Freebase or Yago) available on the web, DBPedia lacks lexical information stating how DBPedia properties should be lexicalised. We apply our lexicalisation model to DBPedia object properties. We construct candidate lexicalisation sets in the following way.
Candidate Lexicalisations As mentioned in Section 1, we consider two main ways of building sets of candidate lexicalisations for a given property p.
E-LEX p : Let WKP p be the set of sentences extracted from Wikipedia which contain at least one mention of two entities that are related in DB-Pedia by the property p. WKP p was built using the pre-processing tools 5 of the MATOLL framework (Walter et al., 2013;Walter et al., 2014b). Then E-LEX p is the corpus of candidate lexicalisations extracted from WKP p using Reverb.
L-LEX p : Given WKP the corpus of Wikipedia sentences, L-LEX p is the corpus of relation mentions extracted from WKP using Reverb and filtered to contain only mentions which include words that are lexically related to the tokens making up the property name. Lexically related words include all synonyms and all derivationally related words listed in Wordnet for a given token.

Evaluation and Results
We compare the output of our lexicalisation method with the following resources and approaches.  DBlexipedia e : a lexicon 6 automatically inferred from Wikipedia using the method described in (Walter et al., 2013;Walter et al., 2014a;Walter et al., 2014b) (c.f. section 2). Lexical entries are inferred using either the extension of the properties (by retrieving sentences containing entities that are related by a DBPedia property and generalising over the dependency paths that connect them.) or synonyms of the words contained in the property name. PATTY: a lexicon automatically inferred from web data using relation extraction and clustering (c.f. (Nakashole et al., 2012)). QUELO: a lexicon automatically derived using the method described in (Trevisan, 2010) (c.f. section 2). Lexical entries are derived by first, tokenizing and pos tagging property names and second, mapping the resulting pos-tagged sequences to pre-defined mention patterns.
For the quantitative evaluation, we use the lexicon developed manually for DBPedia properties by (McCrae et al., 2011) as a gold standard 7 . We test on a held-out set of 30 properties 8 chosen from DBPedia and which were present in the gold standard lexicon, in the other systems we compare with and in the available E-Lex p corpus. Table 1 lists the set of properties.
We compute precision (Correct/Found), recall (Correct/GOLD) and F1 measure of each of the above resources. Recall is the proportion of (property, lexicalisation) pairs present in GOLD which are present in the resource being evaluated, precision the proportion in a resource which is also present in GOLD and F1 is the harmonic mean of precision and recall 9 . In our setup though, precision (and therefore F1) values are artificially decreased because the reference lexicon is small (2.4 lexicalisations in average per property) and often fails to include all possible lexicalisations. The number of correct lexicalisations can therefore be under-estimated while the number of found lexicalisations is usually larger than the number of gold lexicalisations and therefore much larger than the number of correct (= GOLD ∩ Found) lexicalisations.
We report results using different sets of lexicalisation candidates (L-LEX, E-LEX, their union and their intersection) and different thresholds or methods for selecting the final set of lexicalisations. These include: retrieving the n-best lexicalisations (k=10) versus using an adaptive threshold which varies depending on the size of the set of candidate lexicalisations and on the distributions of its ranking scores. We tried taking all lexicalisations over the median (median), over the mid-range ((min+max)/2) or in the third quartile (Q3). We also tested an alternative ranking technique where the score of each lexicalisation is the product of its similarity score (dot product of the embedding vectors representing the property and the lexicalisation) with the frequency of this particular lexicalisation in the set of candidate lexicalisations 10 . We rerank the lexicalisations using these new scores and consider only the lexicalisations in the third quartile of the distribution (FreqQ3). Further if this results in having either less than 7 or more than 25 lexicalisations, we ignore the Q3 constraint and take the 7 and 25 best respectively (FreqQ3Limit(7,25)). Table 3 summarises the results. 9 To determine whether a given property lexicalisation is correct, i.e. present in the GOLD, we use "soft" comparison rather than strict string matching. This consists in checking whether the stemmed gold lexicalisation is contained in a given candidate lexicalisation. For instance, the candidate "main occupation of" and gold "occupation of" are considered as a match. 10 In the set of candidate lexicalisations, the same lexicalisation may occur with minor variations. We compute the frequency of a given lexicalisation by removing adjectives and adverbs and counting the number of repeated occurrences after removing these.
Recall In terms of recall, our results generally outperform QUELO, PATTY and DBlexipedia e .
The low recall score of QUELO shows that simply using patterns based on the property name does not suffice to find appropriate property lexicalisations. This is true in particular of properties such as ROUTEEND where the correct lexicalisation is difficult to guess from the property name.
DBlexipedia e at k=10 scores lower (0.29) than the corresponding version of our approach union(k=10), R:0.38). Interestingly, for our approach, better recall values are consistently obtained using L-LEX suggesting that many of the verbalisations found in GOLD can be extracted from text that is unrelated to the extension of DB-Pedia properties. This is a nice feature as this permits avoiding the data sparsity issue which arises when a DBPedia property has either a restricted extension or a small set WKP p of candidate lexicalisations. Indeed, we found that out of a set of 149 DBPedia properties, the MATOLL corpus did not provide any sentences for 19 of them. In such cases, an approach based only on extensionally related sentences of the property would have zero recall. This is in line with the results of (Walter et al., 2013;Walter et al., 2014a) who observe that such an approach yields a recall of 0.35 whilst combining it with a lexically based approach (using synonyms of the tokens occurring in the property name) permits increasing recall to 0.5.
Finally, although PATTY has a comparatively high recall value (0.59), its precision is very low (0.0015) and versions of our approach with comparable precision (e.g., E-LEX(All)) have a much higher recall (R: 0.80).
Precision As shown in Table 3, the retrieval approach which gives the best results in terms of both precision and F1 is in fact to take the 10-best. Together with the much lower precision achieved by the random baselines (Random*k=10), this result suggests that the similarity function learned by our model appropriately captures the similarity between DBPedia properties and their lexicalisations.
Unsurprisingly, QUELO has the highest precision as it only guesses lexicalisation based on the tokens making up the property name. For instance, for noun property names like OWNER it produces the following two lexicalisations: "owner" and "owner of"; for verb based property names like RECORDEDIN it produces the lexicalisation  "recorded in". On these two properties, QUELO perfectly coincides with the entries defined in GOLD. This explains the high F1 obtained by QUELO. However, as argued in the previous section, QUELO's approach fails to account for cases where the relation name is indirect or opaque. Moreover, it does not support the generation of alternative lexicalisations. For the property EDUCATION, the gold standard defines the the lexical entries "attend", "go to" and "study at" which QUELO fails to produce. DBlexipedia e has a precision score (0.11) comparable to the corresponding version of our approach (union(k=10), P:0.09) and PATTY has a very low precision (P:0.0015). A manual examination of the data shows that the relation extraction approach fails to find a sufficiently large number of distinct property lexicalisations. The lexicalisations found often contain many near repetitions (e.g., "has graduated from, graduated from, graduates") but few distinct paraphrases (e.g., "graduate from, study at").
To better assess, the precision of our system we therefore manually examined the results of our system and annotated all outputs lexicalisations which were correct but not in the gold. Based on this updated gold, precision for union.FreqQ3Limit7-25 is in fact, 0.289. Table 4 shows some example output of our system (for union.FreqQ3Limit7-25) 11 . These examples show that our system correctly predicts additional lexicalisations that are absent from GOLD.

Example Output
They also show that our approach can produce both L-and E-related lexicalisations.
Thus for instance, for the property PROGRAMMINGLANGUAGE, our model produces the lexicalisation "programming language for" which is clearly an L-lexicalisation that can be directly derived from the property name. However, it also derives more context-sensitive E-lexicalisations such as "written in", "uses" and "based on" which are not lexically related to the property name but can be found by considering E-related candidate lexicalisations i.e., sentences such as "FastTacker Digit was written in Pascal" which contain entities that are arguments of the PROGRAMMINGLANGUAGE property.
Similarly, the COUNTRY property whose gold lexicalisation is "located in" (the RDF triple Sakhalin Oblast, country, Russia can be verbalised as " Sakhalin Oblast is located in Russia"), is correctly assigned the lexicalisations "located in" and "part of". Interestingly, our approach also yield more specific lexicalisations such as "is a village/commune/town/county in" which may also be correct lexicalisations given the appropriate subject. For instance, "is a town in" is a correct lexicalisation of the COUNTRY property given the triple Paris, country, France .

Conclusion
We use an embeddings based framework for identifying plausible lexicalisations of KB properties. While embeddings have been much used in domains such as question answering, semantic parsing and relation extraction, they have not been used so far for the lexicalisation task. Conversely, existing approaches to lexicalisation which exploits the similarity between property name and candidate lexicalisations do so on the basis of discrete representations such as WordNet Synsets. In contrast, we learn embeddings of words and KB symbols using distant supervision. We show that, when applied to DBPedia object properties, our approach yields competitive results with these discrete approaches.
As future work, we plan to conduct a larger scale evaluation. This will include the application of the approach to datatype properties and test on a larger set of properties.
The scoring function used by our approach is based on a bag-of-words representation of natural language phrases. We have observed that tuples and candidate lexicalisation phrases like AMERI-CAN FILM INSTITUTE, LOCATION, CALIFORNIA and "A new city was built on a nearby location" are scored high as they share some highly related words. We plan to explore whether a more complex representation of natural language phrases could remedy this shortcoming.