Projective methods for mining missing translations in DBpedia

Each entry (concept) in DBpedia comes along a set of surface strings (property rdfs:label ) which are possible realizations of the concept being described. Currently, only a ﬁfth of the English DBpedia entries have a surface string in French, which severely limits the deployment of Semantic Web Annotation for this language. In this paper, we investigate the task of identifying missing translations, contrasting two projective approaches. We show that the problem is actually challenging, and that a carefully engineered base-line is not easy to outperform.


Introduction
The LOD (Linked Open Data) (Bizer et al., 2009) is conceived as a language independent resource in the sense that the information is represented by abstract concepts to which "human-readable" strings -possibly in different languages -are attached, e.g. the rdfs:label property in DBpedia. For instance, we can access the abstract concept of computer by natural language queries such as ordinateur (rdfs:label@fr) in French or computer (rdfs:label@en) in English. Thanks to this, Semantic Web offers the advantage of having a truly multilingual World Wide Web (Gracia et al., 2012).
At the core of LOD, lies DBpedia (Jens Lehmann, 2014), the largest dataset that constitutes a hub to which most other LOD datasets are linked. 1 Since DBpedia is (automatically) generated from Wikipedia, which is multilingual, one would expect that each concept in DBpedia is labeled with a French surface string. This is for instance the case of the 1 December 2014http://lod-cloud.net/ concept House of Commons of Canada 2 which is labeled in French as Chambre des communes du Canada. One problem, however, is that most labels are currently in English (Gómez-Pérez et al., 2013).
Indeed, the majority of datasets in LOD are primarily generated from the extraction of anglophone resources. DBpedia, the endogenous RDF dataset of Wikipedia is no exception here, since it proposes labels in French (rdfs:label@fr) for only one fifth 3 of the concepts. Of course, all concepts in English Wikipedia have at least one English label. For instance, the concept School life expectancy 4 has -at least at the time of writing -no label in French, while for instance, durée moyenne de scolarité appears in the (French) article Indice_de_développement_humain, 5 and is a good translation of the English term.
This situation comes from the fact that currently, a concept in DBpedia receives as its rdfs:label property in a given language the title of the Wikipedia article which is inter-language linked to the (English) Wikipedia article associated to the DBpedia concept.
The lack of surface strings in a foreign language does not only reduce the usefulness of RDF indexing engines such as sig.ma, 6 but also limits the deployment of Semantic Web Annotator (SWA) systems; e.g. (Mihalcea and Csomai, 2007;Milne and Witten, 2008). This motivates the present study, which aims at automatically mining French labels for the concepts in DBpedia that do not possess one yet.
Identifying the translations of (English) Wikipedia article titles is partially solved in the BabelNet project (Navigli and Ponzetto, 2012). In this project, the translation of concepts in Wikipedia that are not inter-language linked are taken care of by applying machine translation on (minimum 3 and maximum 10) sentences extracted from Wikipedia that contain a link to the article whose title they seek to translate. The most frequent translation is finally selected. There are on the order of 500k articles in English Wikipedia that do not link to an article in French and which are not named entities (which typically do not require translation). BabelNet 7 provides a translation (not necessarily a good one) for 13% of them. This suggests that the projection of a resource such as DBpedia into French is not yet a solved problem.
In the remainder, we describe the approaches we tested in Section 2. Our experimental protocol is presented in Section 3. Section 4 reports the results we obtained. We conclude in Section 5.

Approaches
Identifying the translations of a term in a comparable corpus -two texts (one in each language of interest) that share similar topics without being in translation relation -is a challenge that has attracted many researchers. See (Sharoff et al., 2013) for a recent overview of the state-of-the-art in this field. In this work, we investigated several variants of two approaches for extracting translations from a comparable corpus: the seminal approach described in (Rapp, 1995) which uses a seed bilingual lexicon to induces new translations, and the approach of Bouamor et al. (2013) which instead exploits the Wikipedia structure. The latter approach has been shown to outperform the former significantly on a task of translating 110 terms in 4 different domains, making use of mediumsized corpora. 8

Standard Approach (STAND)
The idea that the context of a term and the one of its translation share similarities that can be used to rank translation candidates has been previously investigated in (Rapp, 1995;Fung, 1998). Since  (Sharoff et al., 2013) for a recent discussion. We reproduced this approach in this work. In a nutshell, each term to be translated is represented by a so-called context vector; that is, the set of words that co-occur with this term in the source part of the corpus. An association measure is typically used to score the strength of the correlation between the term and the context words. Each translation candidate (typically each word of the target vocabulary) is similarly represented in the target language. Thanks to a bilingual seed lexicon, the source context vector is projected into a target one. 9 This projected target language vector is then compared to the vector of each of the target language candidates by the means of a similarity measure.
There are several parameters to the approach among which the size of the window used to collect co-occurrent words, the association and the similarity measures, as well as the seed lexicon.
We investigate the impact of the window size in section 4. We also compare two different association measures, namely the discontinuous oddsratio (Evert, 2005, p. 86) named ORD hereafter, and the log-likelihood ratio (Dunning, 1993), named LLR, the most popular measures used in this line of work. Both measures (Eq. 1 and 2) are computed directly from the (monolingual) contingency table depicted in Table 1 for two words w 1 and w 2 where, for instance, O 12 stands for the number of times w 1 occurs in a window, while w 2 does not.
We did not investigate the impact of the nature and size of the bilingual seed lexicon, but decided to use one large lexicons comprising 116 354 word pairs populated from several available resources as well as an in-house bilingual lexicon. 10 A similar choice is made in (Bouamor et al., 2013) where a seed lexicon of approximately 120 000 entries is being used, and in (Hazem et al., 2013), where the the authors use a lexicon of 200 000 entries (before preprocessing).
Since in (Laroche and Langlais, 2010) the best performing variant uses the cosine similarity measure (Eq. 3), we used it in our experiments. 11 In the standard approach, the co-occurrent words are extracted from all the source documents of the comparable corpus in which the term to translate appears. We name this variant STAND hereafter.

Neighbourhood variants (LKI, LKO, CMP and RA)
Since we are interested in translating Wikipedia titles, a natural way of populating the context vector of a term is to consider the occurrences of this term in the article whose title we seek to translate. This avoids populating the context vector with words co-occurring with different senses of the word to translate. We implemented such a variant which is inherently facing the issue that too few occurrences of the term of interest may appear in a single article, especially in our case where the average length of a Wikipedia article is approximatively 1 400 words. Therefore we considered a variant which involves a neighbourhood function, that is, a function that returns a set of Wikipedia articles related to the one under consideration for translation. We investigated three such functions (as well as many combinations of them): LKI(a) returns the set of articles that have a link pointing to the article a under consideration (in links). For instance, both Computer_Science and Art are two articles pointing to Entertainment. LKO(a) returns the set of articles to which a points to (out links). For instance the article Entertainment points to Party and Fun.
CMP(a) returns the set of articles that are the most similar to a. We used the MoreLikeThis method of the search engine Lucene 12 for this. For instance, Dance and Dance in Indonesia are the top-2 documents returned by this function for the article Entertainment.
For sanity check purposes, we also considered the RND function which randomly returns articles. Note that the LKI() and LKO() functions were obtained with the Wikipedia Miner toolkit (Milne and Witten, 2013).

Explicit Semantic Analysis (ESA-B)
We also implemented the approach described in (Bouamor, 2014) which has been shown by the author to be more accurate than the aforementioned standard approach. The proposed method is an adaptation of the Explicit Semantic Analysis approach described in (Gabrilovich and Markovitch, 2007).
A term to translate is represented by the titles of the Wikipedia articles in which it appears. The projection of the resulting context vector into the target language is obtained by following the available inter-language links. 13 The words of the articles reached this way are candidates to the translation and are further ranked by a tf-idf schema. This approach avoids the need for a seed bilingual lexicon, but uses instead the structure of Wikipedia, and its multilingualism more particularly.
One meta-parameter of this approach is the maximum size of the context vector, that is, the maximum number of article titles to keep for describing a term. One might think that considering all the articles in which a term to translate is found is a good idea, but this strategy faces some sort of semantic drift. For instance, while translating the term tears, the context vector is populated with articles related to music albums that contain this term in their text content, while the associated French article (when available) almost never contains the translation. We investigate this metaparameter in section 4. The other parameters were set as recommended in (Bouamor, 2014 Although some articles that share an inter-language link are parallel (Patry and Langlais, 2011), most article pairs are actually only comparable (Hovy et al., 2013).

English terms without translation
The vast majority (82,3%) of articles in the English Wikipedia do not have a link to an article in the French Wikipedia. We are interested to identify the translation of their title. Yet, we noticed that many of them are actually describing named entities (persons, geographic places, etc.), which typically do not require translation. 14 In order to filter named entities, we applied the Babel-Net filter. 15 We ended up with a list of 521 895 (18,5%) terms we ultimately seek to translate. In this study, we further narrowed down our interest on unigrams. 16 This represents roughly 30% of those English terms.

Reference List
To evaluate our different approaches, we build a test set -a list of English source terms and their reference (French) translation. For this, we randomly sampled pairs of articles in Wikipedia that are inter-language linked. It is accepted that the titles of a pair of articles inter-language linked often constitute good translations (Hovy et al., 2013). Therefore, for each term (title) of our test set, we collected the associated title as a reference translation.
The sampling was done without considering named entities. For this purpose, we only considered article pairs which English title belongs to the bilingual lexicon we used as a seed lexicon for the STAND approach. Since the frequency of a source term is a key parameter of projective approaches, we also paid attention to vary the frequency range of the English terms we considered in our test set. More precisely, we gathered terms in those different ranges: infrequent   We measured that using a large parallel corpus, 17 we could only identify the translation of roughly 1% of those terms, which indicates that parallel data might be of little interest in identifying the translations of Wikipedia article titles.

Evaluation
Our approaches have been configured to produce a ranked list of (at most) 20 candidates for each source (English) term. We compute two metrics to compare them: precision at rank 1 (P@1) which indicates the percentage of terms for which the best ranked candidate is the reference one, and Mean Average Precision at rank 20 (MAP-20), a measure commonly used in information retrieval (Manning et al., 2008) which averages precision at various recall rates.

Technical considerations
The standard approach (STAND) can be rather computation and time consuming, since any target word in Wikipedia is a potential candidate for a given source term, and we are dealing with a rather large comparable corpus. Just as an illustration, the word france occurs more than 1 million times in the French Wikipedia, and its context vector potentially contains as much as 136 514 words (considering a context window of 6 words). Therefore, in our experiments, we only consider the first 50 000 occurrences of each term while populating the context vectors. Also, comparing source and target vectors can be time consuming, especially with context vectors of very high dimension. To save some time (and memory), we only represent a context vector (source or target) by (at most) the 1000 top-ranked terms according to the association measure being used.

STAND
In some calibration experiments, 18 we observed that increasing the size of the window in which we collect the context words leads to noise (see Table 3). The optimal window size was 6 (3 words on each side of the word under consideration, excluding function words), which means that the cooccurrent words should be taken in the immediate vicinity of the term to translate. This corroborates the study in (Bullinaria and Levy, 2007). Therefore, we set the value of this meta-parameter to 6 in the remainder.  Table 3: MAP-20 of STAND (ORD) measured on a development set, as a function of the window size (counted in word).
The results of two variants of the standard approach are reported in Table 4 (line 1 and 2). Clearly, using ORD as an association measure drastically improves performance. This definitely corroborates the findings of Laroche and Langlais (2010). Still, the differences between both variants is surprisingly high: ORD delivers over six time higher performance than LLR does on av-erage, while in the aforementioned work, the difference was much less marked. 19 Therefore, we use this association measure in the neighbourhood variants we tested.
We observed in practice the tendency of ORD to reward word pairs that appear often together even though the frequency of each word is very low. Thus, the context vector gathered with ORD tend to contain rare words that only appear in the context of the article under consideration. Those words offer a good discriminative power in our task, thus leading to much higher performance than the context vectors computed by LLR, which tend to gather more general related terms. This tendency can be observed in Figure 1 where ORD leads to a context vector with much more specific words. This observation deserves further investigations. ORD LLR myringoplasty (16.32) tube (147.6) myringa (16.14) laser ( A second observation that can be made is the strong correlation between the frequency of the term to translate and the performance of the approach. As a matter of fact, the performance for very frequent terms ([1001+]) is more than ten times the one measured on infrequent ones ([1-25]). This is a well-know fact that has been analyzed for instance in (Prochasson and Fung, 2011) where the authors report a precision of 60% for frequent test words (words seen at least 400 times), but only 5% for rare words (seen less than 15 times).
Overall, and even if a close comparison is difficult, the results we obtained for STAND are in-  Table 4: Precision (at rank 1) and MAP-20 of some variants we tested. Each neighbourhood function was asked to return (at most) 1000 English articles. The ESA-B variant is making use of context vectors of (at most) 30 titles.
line with those reported in (Laroche and Langlais, 2010) that also focused on Wikipedia, but mining translations of medical terms. The authors reported a precision at rank one ranging from 20.7% up to 42.3% depending on test sets and configurations considered.
As we discussed in Section 3.5, due to computational issues, we cut the context vectors of the STAND approach after 1 000 terms. In order to measure how sensitive this cut-off is, we computed a variant where the top-100 terms only are kept (considering the association measure). The results of this variant are reported in line 3 of Table 4. As expected, the performance of the STAND approach drops significantly on average, and especially for very frequent terms ([1001+]).

Neighbourhood variants
We tested our neighbourhood functions as well as several combinations of them. One metaparameter we investigated is the maximum number of articles returned by a function. We early observed that the more the better, something we explain shortly. Thereafter, each function was asked to return at most 1 000 articles. The results obtained by the 3 neighbourhood functions we described in section 2 are reported in lines 4 to 6 of Table 4.
Clearly, all the neighbourhood variants we considered yielded a significant drop in performance, which is disappointing from a practical point of view. This suggests that there is no obvious way to reduce the number of source documents to consider while populating the context vector of the term to translate. One explanation for this is that in our implementation, the context vector of each tar-get candidate is computed by considering the full (French) Wikipedia collection. This dissymmetry introduces a mismatch between the source and target context vectors, leading to poor performances. A solution to this problem consists in computing target context vectors online from a subset of target documents of interest. 20 A drawback of this solution is (of course) that the computation must take place for each term to translate. This is left as a future work.
At least, the neighbourhood variants we experimented outperform the one where random documents are sampled (RND). This latter variant could not translate a single term of the test set.

ESA-B
In the default configuration of the approach described in (Bouamor et al., 2013), the authors limit the size of the context vector to 100, which we found suboptimal in our case. We varied the dimension of the context vectors and observed the best value to be 30 (see Table 5). This is the value used in the sequel.  Somehow contrary to what has been observed in (Bouamor et al., 2013), we observe that ESA-B (P@1 = 0.211) under-performs the STAND approach with the ORD association measure (P@1 = 0.338). One explanation for the difference is that, in (Bouamor et al., 2013), the authors filter in words such as nouns, verbs and adjectives when populating the context vectors, while we do not. This filter might interfere with the observation made in section 4.1 that, with ORD, rare words (which might be filtered out, such as URLs or even spelling mistakes) tend to appear in the context vectors, and happen to help in discriminating translations.

Analysis
If we consider the 528 test terms that appear over a hundred times in Wikipedia ([101+]), a test case where both approaches perform well, STAND (ORD) translates correctly 362 of them (considering the top-20 solutions), while ESA-B translates 351. If we had an oracle telling us which variant to trust for a given term, we could translate correctly 431 terms (81.6%), which indicates the complementarity of both approaches.
We analyzed the 97 terms for which our two approaches failed to propose the reference translation in the top-20 candidates and we identified a number of recurrent cases we describe hereafter.
First, English terms do appear in the French Wikipedia material that eventually get selected by the STAND approach. This is, for instance the case for the term barber (oracle translation: coiffeur) for which STAND proposed the translation barber.
Second, we observed that STAND (and perhaps ESA-B in a less systematic way) often proposes morphological variants of the reference translation. For instance, coudre (a verbal form) is the first proposed translation for sewing, while the reference translation is the noun couture.
Third, it happens in a few cases that the reference translation, although correct is very specific. Of course this penalizes equally both approaches we tested. For instance, the reference translation of veneration is dulie, while the first translation produced by STAND is vénération (a correct translation).
Also, and by far the most frequent case, we observed a thesaurus effect of both approaches where terms related to the source one are proposed. This effect can be observed in Figure 2 in which top candidates proposed by several variants we tested are reported for the terms exemplified in Table 2.
Finally, it happens that the top-20 candidates proposed are just noise (e.g. noun translated as spora).

Discussion
In this study, we implemented and compared two projective approaches for identifying the translation of terms that correspond to articles in English Wikipedia that do not have an inter-language link to an article in the French Wikipedia. Doing so would potentially help in enriching the rdfs:label property attached to concepts in DBpedia, thus easing semantic annotation in French. One method is a variant of the popular approach pioneered by (Rapp, 1995) which uses a bilingual seed lexicon for mapping source and target context vectors, and the other one has been proposed in (Bouamor et al., 2013) for which the authors shown to deliver state-of-the-art performance.
Among other things, our experiments suggest that the STAND approach performs as well or better than the ESA-B approach and combining both approaches, especially for high frequency terms might improve our results.
We also observed the well-known bias of those approaches toward frequent terms, which urges the need for methods adapted to less frequent terms. As a future work, we will investigate the solution proposed in (Prochasson and Fung, 2011) which is one step in this direction.
Also, the projective methods we considered embed several meta-parameters which values are sensible. It is therefore difficult to know a priori which configuration to chose for a given task, without conducting costly calibration experiments.
Having at our disposal a number of different test cases would help in developing expertise in doing so.
With the hope that this might help, the code and resources used in this work will be available at this url: http://rali.iro.umontreal. ca/rali/?q=fr/Ressources