BabelDomains: Large-Scale Domain Labeling of Lexical Resources

In this paper we present BabelDomains, a unified resource which provides lexical items with information about domains of knowledge. We propose an automatic method that uses knowledge from various lexical resources, exploiting both distributional and graph-based clues, to accurately propagate domain information. We evaluate our methodology intrinsically on two lexical resources (WordNet and BabelNet), achieving a precision over 80% in both cases. Finally, we show the potential of BabelDomains in a supervised learning setting, clustering training data by domain for hypernym discovery.


Introduction
Since the early days of Natural Language Processing (NLP) and Machine Learning, generalizing a given algorithm or technique has been extremely challenging. One of the main factors that has led to this issue in NLP has been the wide variety of domains for which data are available (Jiang and Zhai, 2007). Algorithms trained on the business domain are not to be expected to work well in biology, for example. Moreover, even if we manage to obtain a balanced training set across domains, our algorithm may not be as effective on some specific domain as if it had been trained on that same target domain. This issue has become even more challenging and significant with the rise of supervised learning techniques. These techniques are fed with large amounts of data and ought to be able generalize to various target domains. Several studies have proposed regularization frameworks for domain adaptation in NLP (Daumé III and Marcu, 2006;Daumé III, 2007;Lu et al., 2016). In this paper we tackle this problem but approach it from a different angle. Our main goal is to integrate domain information into lexical resources, which, in turn, could enable a semantic clusterization of training data by domain, a procedure known as multi-source domain adaptation (Crammer et al., 2008). In fact, adapting algorithms to a particular domain has already proved essential in standard NLP tasks such as Word Sense Disambiguation (Magnini et al., 2002;Agirre et al., 2009;Faralli and Navigli, 2012), Text Categorization (Navigli et al., 2011), Sentiment Analysis (Glorot et al., 2011;Hamilton et al., 2016), or Hypernym Discovery (Espinosa-Anke et al., 2016), inter alia.
The domain annotation of WordNet (Miller et al., 1990) has already been carried out in previous studies (Magnini and Cavaglià, 2000;Bentivogli et al., 2004;Tufiş et al., 2008). Domain information is also available in IATE 1 , a European Union inter-institutional terminology database. The domain labels of IATE are based on the Eurovoc thesaurus 2 and were introduced manually. The fact that each of these approaches involves manual curation/intervention limits their extension to other resources, and therefore to downstream applications.
We, instead, have developed an automatic hybrid distributional and graph-based method for encoding domain information into lexical resources. In this work we aim at annotating BabelNet (Navigli and Ponzetto, 2012), a large unified lexical resource which integrates WordNet and other resources 3 such as Wikipedia and Wiktionary, augmenting the initial coverage of WordNet by two orders of magnitude.

Methodology
Our goal is to enrich lexical resources with domain information. To this end, we rely on BabelNet 3.0, which merges both encyclopedic (e.g. Wikipedia) and lexicographic resources (e.g. WordNet). The main unit in BabelNet, similarly to WordNet, is the synset, which is a set of synonymous words corresponding to the same meaning (e.g., {midday, noon, noontide}). In contrast to WordNet, a BabelNet synset may contain lexicalizations coming from different resources and languages. Therefore, the annotation of a BabelNet synset could directly be expanded to all its associated resources.
As domains of knowledge, we opted for domains from the Wikipedia featured articles page 4 . This page contains a set of thirty-two domains of knowledge. 5 Table 1 shows the set of thirtytwo domains. For each domain, there is a set of Wikipedia pages associated (127 on average). For instance, the Wikipedia pages Kolkata and Oklahoma belong to the Geography domain 6 . Our methodology for annotating BabelNet synsets with domains is divided into two steps: (1) we apply a distributional approach to obtain an extensive distribution of domain labels in BabelNet (Section 2.1), and (2) we complement this first step with a set of heuristics to improve the coverage and correctness of the domain annotations (Section 2.2).

Distributional similarity
We exploit the distributional approach of Camacho-Collados et al. (2016, NASARI). NASARI 7 provides lexical vector representations for BabelNet synsets. In order to obtain a full distribution for each BabelNet synset, i.e. a list 4 https://en.wikipedia.org/wiki/ Wikipedia:Featured_articles 5 Biography domains are not considered. 6 For simplicity we refer to each domain with its first word (e.g., Geography to refer to Geography and Places). 7 http://lcl.uniroma1.it/nasari/ of ranked domains associated, each domain is first associated with a given vector. Then, the Wikipedia pages from the featured articles page are leveraged as follows. First, all Wikipedia pages associated with a given domain are concatenated into a single text. Second, a lexical vector is constructed for each text as in , by applying lexical specificity over the bag-of-word representation of the text. Finally, given a BabelNet synset s, the similarity between its respective NASARI lexical vector and the lexical vector of each domain is calculated using the Weighted Overlap comparison measure (Pilehvar et al., 2013). 8 This enables us to obtain, for each BabelNet synset, scores for each domain label denoting their importance. For notational brevity, we will refer to the domain whose similarity score is highest across all domains as its top domain. For instance, the top domain of the BabelNet synset corresponding to rifle is Warfare, while its second domain is Engineering. In order to increase precision, initially we only tag those BabelNet synsets whose maximum score is higher than 0.35. 9

Heuristics
We additionally propose three heterogeneous heuristics to improve the quality and coverage of domain annotations. These heuristics are applied in cascade (in the same order as they appear on the text) over the labels provided on the previous step.
Taxonomy. This first heuristic is based on the BabelNet hypernymy structure, which is an integration of various taxonomies: WikiData, Word-Net and MultiWiBi (Flati et al., 2016). The main intuition is that, in general, synsets connected by a hypernymy relation tend to share the same domain (Magnini and Cavaglià, 2000). 10 This taxonomybased heuristic is intended to both increase coverage and refine the quality of synsets annotated by the distributional approach. First, if all the hypernyms (at least two) of a given synset share the same top domain, this synset is annotated (or reannotated) with that domain. Second, if the top domain of an annotated synset is different from at least two of its hypernyms, this domain tag is removed.
Labels. Some Wikipedia page titles include general information about the page between parentheses. This text between parentheses is known as a label. For example, the Wikipedia page Orange (telecommunications) has telecommunications as its label. In BabelNet these labels are kept in the main senses of many synsets, information which is valuable for deciding their domain. For those synsets sharing the same label, we create a distribution of domains, i.e. each label is associated with its corresponding synsets and their domains. Then, we tag (or retag) all the synsets containing the given label provided that the most frequent domain for that label gets a number of instances higher than 80% of the total of instances containing the same label. 11 As an example, before applying this heuristic the label album contained 14192 synsets which were pre-tagged with a given domain. From those 14192 synsets, 14166 were pretagged with the Music domain (99.8%). Therefore, the remaining 26 synsets and all the rest containing the album label were tagged or re-tagged with the Music domain.
Propagation. In this last step we propagate the domain annotations over the BabelNet semantic network. First, given an unannotated input synset, we gather a set with all its neighbours in the Ba-belNet semantic network. Then we retrieve the domain with the highest number of synsets associated among all annotated synsets in the set. Similarly to the previous heuristic, if the number of synsets of such domain amounts to 80% of the whole set, we tag the input synset with that domain. Otherwise, we repeat the process with the  second-level neighbours and, if still not found, with its third-level neighbours.

BabelDomains: Statistics and Release
We applied the methodology described in Section 2 on BabelNet 3.0. This led to a total of 2.68M synsets tagged with a domain. Note that this number greatly improves on the number given in previous studies for WordNet. In our approach, in addition to WordNet, we provide annotations for other lexical resources such as Wikipedia or Wiktionary. Table 2 shows some statistics of the synsets tagged in each step of the whole domain annotation process. The largest number of annotated synsets were obtained in the first distributional step (1.31M) and the final propagation (1.11M), while the taxonomy and labels heuristics contributed to not only increasing the coverage, but also to refining potentially dubious annotations. BabelDomains is available for download at lcl.uniroma1.it/babeldomains. In the release we include a confidence score 12 for each domain label. Additionally, the domain labels have been integrated into BabelNet 13 , both in the API and in the online interface 14 .

Evaluation
We evaluated BabelDomains both intrinsically (Section 4.1) and extrinsically on the hypernym discovery task (Section 4.2).

Intrinsic Evaluation
In this section we describe the evaluation of our domain annotations on two different lexical resources: BabelNet and WordNet. To this end, we used the domain-labeled datasets released by . The WordNet dataset is composed of 1540 synsets tagged with a domain. These domain labels were taken from WordNet 3.0 and manually mapped to the domains of the Wikipedia featured articles page. The Ba-belNet dataset is composed of 200 synsets randomly extracted from BabelNet 3.0 which were manually annotated with domains. As comparison systems we included a baseline based on Wikipedia (Wikipedia-idf). This baseline first constructs a tf-idf -weighted bag-ofword vector representation of Wikipedia pages and, similarly to our distributional approach, calculates its similarity with the concatenation of all Wikipedia pages associated with a domain in the Wikipedia featured articles page. 15 We additionally compared with WN-Domains-3.2 (Magnini and Cavaglià, 2000;Bentivogli et al., 2004), which is the latest released version of WordNet Domains 16 . However, this approach involves manual curation, both in the selection of seeds and correction of errors. In order to enable a fair comparison, we report the results of a system based on its main automatic component. This baseline takes annotated synsets as input and propagates them through the WordNet taxonomy (WN-Taxonomy Prop.). Likewise, we report the results of the same baseline by propagating through the BabelNet taxonomy (BN-Taxonomy Prop.). These two systems were evaluated by 10-fold cross validation on the 15 For the annotation of WordNet we used the direct Wikipedia-WordNet mapping from BabelNet. 16 http://wndomains.fbk.eu/ corresponding datasets. Finally, we include the results of the distributional approach performed in the first step of our methodology (Section 2.1). Table 3 shows the results of our system and four comparison systems. Our system achieves the best overall F-Measure results, with precision figures above 80% on both WordNet and Babel-Net datasets. These results clearly improve the results achieved by applying the first step of distributional similarity only, highlighting that the inclusion of the heuristics was beneficial. These precision figures are especially relevant considering the large set of domains (32) used in our methodology. By analyzing the errors, we realized that our system tends to provide domains close to the gold standard. For instance, the synset referring to entitlement 17 was tagged with the Business domain instead of the gold Law. Other domains which produced imperfect choices due to their close proximity were Mathematics-Computing and Animals-Biology. As regards the generally low recall on the BabelNet dataset, we found that it was mainly due to the nature of the dataset, including many isolated synsets which are hardly used in practice.

Extrinsic Evaluation
One of the main applications of including domain information in sense inventories is to be able to cluster textual data by domain. Supervised systems may be particularly sensitive to this issue (Daumé III, 2007), and therefore training data should be clustered accordingly. In particular, two recent studies found that clustering training data was essential for distributional hypernym discovery systems to perform accurately (Fu et al., 2014;Espinosa-Anke et al., 2016). They discovered that  hypernymy information is not encoded equally in different regions of distributional vector spaces, as it is stored differently depending on the domain. The hypernym discovery task consists of, given a term as input, finding its most appropriate hypernym. In this evaluation we followed the approach of Espinosa-Anke et al. (2016, TaxoEmbed), who provides a framework to train a domainwise transformation matrix (Mikolov et al., 2013) between the vector spaces of terms and hypernyms. As in the original work, we used the senselevel vector space of Iacobacci et al. (2015) and training data from Wikidata. 18 We used the domain annotations of BabelDomains for clustering the training data by domain, and compared it with the domains obtained through the distributional step, as used in Espinosa-Anke et al. (2016). We additionally included a baseline which did not filter the training data by domain. The training data 19 was composed of 20K term-hypernym pairs for the domain-filtered systems and 200K for the baseline, while the test data was composed of 250 randomly-extracted terms with their corresponding hypernyms in Wikidata. Table 4 shows the results of TaxoEmbed in the hypernym discovery task on the same ten domains 20 evaluated in Espinosa-Anke et al. (2016). Our domain clusterization achieves the best overall results, outperforming the clusterization based solely on distributional information in nine of the ten domains. The results clearly show the need for a pre-clusterization of the training data, confirming the findings of Espinosa-Anke et al. (2016) and Fu et al. (2014). Training directly without preclusterization leads to very poor results, despite being trained on a larger sample. This baseline 18 We used the code and data available at http://www. taln.upf.edu/taxoembed 19 Training data was extracted randomly from Wikidata, excluding the terms of the test data. 20 Domains are represented by their three initial letters. From left to right in the table: Art, Biology, Education, Geography, Health, Media, Music, Physics, Transport, and Warfare.
provides competitive results on Biology only, arguably due to the distribution of Wikidata where biology items are over-represented.

Conclusion
In this paper we presented BabelDomains, a resource that provides unified domain information in lexical resources. Our method exploits at best the knowledge available in these resources by combining distributional and graph-based approaches. We evaluated the accuracy of our approach on two resources, BabelNet and WordNet. The results showed that our unified resource provides reliable annotations, improving over various competitive baselines. In the future we plan to extend our set of domains with more fine-grained information, providing a hierarchical structure following the line of Bentivogli et al. (2004).
As an extrinsic evaluation we used BabelDomains to cluster training data by domain prior to applying a supervised hypernym discovery system. This pre-clustering proved crucial for finding accurate hypernyms in a distributional vector space. We are planning to further use our resource for multi-source domain adaptation on other NLP supervised tasks. Additionally, since BabelNet and most of its underlying resources are multilingual, we plan to use our resource in languages other than English.