Latent semantic network induction in the context of linked example senses

The Princeton WordNet is a powerful tool for studying language and developing natural language processing algorithms. With significant work developing it further, one line considers its extension through aligning its expert-annotated structure with other lexical resources. In contrast, this work explores a completely data-driven approach to network construction, forming a wordnet using the entirety of the open-source, noisy, user-annotated dictionary, Wiktionary. Comparing baselines to WordNet, we find compelling evidence that our network induction process constructs a network with useful semantic structure. With thousands of semantically-linked examples that demonstrate sense usage from basic lemmas to multiword expressions (MWEs), we believe this work motivates future research.


Introduction
Wiktionary is a free and open-source collaborative dictionary 1 (Wikimedia). With the ability for anyone to add or edit lemmas, definitions, relations, and examples, Wiktionary has the potential to be larger and more diverse than any printable dictionary. Wiktionary features a rich set of examples of sense usage for many of its lemmas which, when converted to a usable format, supports language processing tasks such as sense disambiguation (Meyer and Gurevych, 2010a;Miller and Gurevych, 2014) and MWE identification (Muzny and Zettlemoyer, 2013;Salehi et al., 2014;Hosseini et al., 2016). With natural alignment to other languages, Wiktionary can likewise be used as a resource for machine translation tasks Borin et al., 2014;Göhring, 2014). With these uses in mind, this work introduces the creation of a network-much like the Princeton Word-Net (Miller, 1995;Fellbaum, 1998)-that is constructed solely from the semi-structured data of Wiktionary. This relies on the noisy annotations of the editors of Wiktionary to naturally induce a network over the entirety of the English portion of Wiktionary. In doing so, the development of this work produces: • an induced network over Wiktionary, enriched with semantically linked examples, forming a directed acyclic graph (DAG); • an exploration of the task of relationship disambiguation as a means to induce network construction; and • an outline for directions of expansion, including increasing precision in disambiguation, cross-linking example usages, and aligning English Wiktionary with other languages.
We make our code freely available 2 , which includes code to download data, to disambiguate relationships between lemmas, to construct networks from disambiguation output, and to interact with networks produced through this work.
2 Related work 2.1 WordNet The Princeton WordNet, or WordNet as it's more commonly referred to, is a lexical database originally created for the English language (Miller, 1995;Fellbaum, 1998). It consists of expertannotated data, and has been more or less continually updated since its creation (Harabagiu et al., 1999;Miller and Hristea, 2006). WordNet is built up of synsets, collections of lexical items that all have the same meaning. For each synset, a definition is provided, and for some synsets, usage examples are also presented. If extracted and attributed properly, the example usages present on Wiktionary could critically enhance WordNet by filling gaps. While significant other work has been done in utilizing Wiktionary to enhance WordNet for purposes like this (discussed in the next sections), this work takes a novel step by constructing a wordnet through entirely computational means, i.e. under the framing of a machine learning task based on Wiktionary's data.

Wiktionary
Wiktionary is an open-source, Wiki-based, open content dictionary organized by the WikiMedia Foundation (Wikimedia). It has a large and active volunteer editorial community, and from its noisy, crowd-sourced nature, includes many MWEs, colloquial terms, and their example usages, which could ultimately fill difficult-to-resolve gaps left in other linguistic resources, such as WordNet.
Thus, Wiktionary has a significant history of exploration for the enhancement of WordNet, including efforts that extend WordNet for better domain coverage of word senses (Meyer and Gurevych, 2011;Gurevych et al., 2012;Miller and Gurevych, 2014), automatically derive new lemmas (Jurgens and Pilehvar, 2015;Rusert and Pedersen, 2016), and develop the creation of multilingual wordnets (de Melo and Weikum, 2009;Gurevych et al., 2012;Bond and Foster, 2013). While these works constitute important steps in the usage of extracted Wiktionary contents for the development of Word-Net, none before this effort has attempted to utilize the entirety of Wiktionary alone for the construction of such a network.
Most similarly, Wiktionary has been used in a sense-disambiguated fashion (Meyer and Gurevych, 2012b) and to construct an ontology (Meyer and Gurevych, 2012a). Our work does not create an ontology, but instead attempts to create a semantic wordnet. In this context, our work can be viewed as building on notions of sense-disambiguating Wiktionary to construct a WordNet-like resource.

Relation Disambiguation
The task of taking definitions, a semantic relationship, and sub-selecting the definitions that belong to that relationship is one of critical importance to our work. Sometimes called sense linking or rela-tionship anchoring, this task has been previously explored in the creation of machine-readable dictionaries (Krovetz, 1992), ontology learning Pennacchiotti, 2006, 2008), and German Wikitionary (Meyer and Gurevych, 2010b).
As mentioned above, Meyer and Gurevych explore relationship disambiguation in the context of Wiktionary, motivating a sense-disambiguated Wiktionary as a powerful resource (Meyer and Gurevych, 2012a,b). This task is frequently viewed as a binary classification: Given two linked lemmas, do these pairs of definitions belong to the relationship? While easier to model, this framing can suffer from a combinatorial explosion as all pairs of definitions must be compared. This work attempts to model the task differently, disambiguating all definitions in the context of a relationship and its lemmas.

Framework
This work starts by identifying a set of lemmas, W , and a set of senses, S. It then proceeds, assuming that S forms the vertex set of a Directed Acyclic Graph (DAG) with edge set E, organizing S by refinement of specificity. That is, if senses s, t ∈ S have a link (t, s) ∈ E-to s-then s is one degree of refinement more specific than t.
Next, we suppose a lemma u ∈ W has relation ∼ (e.g., synonymy) indicated to another lemma v ∈ W . Assuming ∼ is recorded from u to v (e.g., from u's page), we call u the source and v the sink. Working along these lines, the model then assumes a given indicated relation ∼ is qualified by a sense s; this semantic equivalence is denoted u s ∼ v. Like others (Landauer and Dumais, 1997;Blei et al., 2003;Bengio et al., 2003), this work assumes senses exist in a latent semantic space. Processing a dictionary, one can empirically discover relationships like u s ∼ v and v t ∼ w. But for a larger network structure one must know if s = tthat is, do s and t refer to the same relationshipand often neither s nor t are known, explicitly. Hence, this work sets up approximations of s and t for comparison. Given a lemma, u ∈ W , suppose a set of definitions, D u , exists and form the basis for disambiguation of a lemma's senses. We then assume that for any d ∈ D u there exists one or more senses, s ∈ S, such that d =⇒ s, that is, the definition d conveys the sense s.
Having assumed a DAG structure for S, this work denotes specificity of sense by using the formalism of a partial order, , which, for senses s, t ∈ S having s t, indicates that the sense s is comparable to t and more specific. Note thatas with any partial order-senses can be, and are often non-comparable.
Intuitively, a given definition d might convey multiple senses d =⇒ s, t of differing specificities, s t. So for a given definition d, the model's goal is to find the sense t that is least specific in being conveyed. Satisfying this goal implies resolving the sense identification function, f : D → S, for which any lemma u ∈ W and definition d ∈ D u with d =⇒ s ∈ S, it is assured that s f (d). Since no direct knowledge of any s ∈ S is assumed known for any annotated relationship between lemmas, systems must approximate senses according to the available resources, e.g., definitions or example usages.

Task development
On Wiktionary, every lemma has its own page. Each page is commonly broken down into sections such as languages, etymologies, and partsof-speech (POS). Under each POS, a lemma features a set of definitions that can be automatically extracted. An example of the word induce on English Wiktionary can be seen in Figure 1.
A significant benefit of using Wiktionary as a resource to build a wordnet lies in the wealth of examples it offers. Examples come in two flavors: basic usage and usage from reference material. Currently, each example is linked to its origination definition and lemma, however, in future works, these examples could be segmented and sense disambiguated, offering new network links and densely connected example usages.
For each lemma, Wiktionary may offer relationship annotations between lemmas. These relationships span many categories including acronyms, alternative forms, anagrams, antonyms, compounds, conjugations, derived terms, descendants, holonyms, hypernyms, hyponyms, meronyms, related terms, and synonyms. For this work's purposes, only antonyms and synonyms are considered, exploiting their more typical structure on Wiktionary and clear theoretical basis in semantic equivalence to induce a network. Exploring more of these relationships is of interest in future work.
Additionally, a minority of annotations present 'gloss' labels, which indicate the definitions that apply to relationships. So from the data there is some knowledge of exact matching, but due to their limited, noisy, and crowd-sourced nature, the labelings may not cover all definitions that belong. We assume annotations exhibit relationships between lemmas. Finding one: u s ∼ v, if u is the source, we assume there exists some definition d ∈ D u that implies the appropriate sense: d =⇒ s. This good practice assumption models editor behavior as a response to exposure to a particular definition on the source page. Provided this, an editor won't necessarily annotate the relationship on the sink page-even if the sink page has a definition that implies the sense s. Thus, our task doesn't require identification of a definition on the sink's page. More precisely, no Altogether, for an annotated relationship the task aims to identify the sense-conveying subset: for which at least one definition must be drawn from D u . Note that the model does not assume that arbitrary d,d ∈ D u s ∼v map through the sense identification function to the same most general sense. Presently, these details are resolved by a separate algorithm (developed below), leaving direct modeling of the sense identification function to future work. 3

Semantic hierarchy induction
This section outlines preliminary work inferring a semantic hierarchy from pairwise relationships. If A is the set of relationships, a model's output, C, will be a collection of sense-conveying subsets, D u s ∼v , in one-to-one correspondence: A ↔ C. So, for all D ∈ P(C), one has a covering of (some) senses by pairwise relationships, D u s ∼v ∈ D.
Under our assumptions, any collection of sense conveying subsets D ∈ P(C) with non-empty intersection restricts to a set of definitions that must convey at least one common sense, s . Notably, s must be at least as general as any qualifying a particular annotated relationship, i.e., s s for any s (implicitly) defining any D u s ∼v ∈ D. So this work induces the sense-identification function, f , through pre-images: for D ∈ P(C), an implicit sense, s, is assumed such that that f −1 (s) ⊆ D D u s ∼v . Now, if a covering D ⊃ D exists with non-empty intersection, then its (smaller) intersection comprises definitions that convey a sense, s which is more-general than s. So to precisely resolve f through pre-images the model must 'hole punch' the more-general definitions, constructing the hierarchy by allocating the more general definitions in the intersection of D to the more general senses: This allocates each definition to exactly one implicit sense approximation, t, which is the most general sense indicated by the definition. Additionally, all senses then fall under a DAG hierarchy (excepting the singletons, addressed below) as set inclusion, D ⊃ D defines a partial order. This deterministic algorithm for hierarchy induction is presented in Algorithm 1.
Considering the output of a model, C, if d is not covered by C the model assumes a singleton sense. These include definitions not selected during relationship disambiguation as well as the definitions of lemmas that feature no relationship annotations. Singletons are then placed in the DAG at the lowest level, disconnected from all other senses. Figure 2 visually represents this full semantic hierarchy.
Algorithm 1 Construction of semantic hierarchy through pairwise collection.
Append(f iltered, p \ def s) end for Append(levels, f iltered) prev ← next end while return levels

Characteristics of Wiktionary data
Data was downloaded from Wiktionary on 1/23/19 using the Wikimedia Rest API 4 . To evaluate performance, a 'gold' dataset was created to compare modeling strategies. In totality, 298,377 synonym and 44,758 antonym links were generated from Wiktionary. 'Gold' links were randomly sampled, selecting 400 synonym and 100 antonym links. For each link, source and sink lemmas were considered independently. Definitions were included if they could plausibly refer to the other lemma. This process is supported by the available examples, testing if one lemma can replace the other lemma in the example usages. This dataset was constructed in contrast to other Wiktionary relationship disambiguation tasks due to the modeling differences and desire for more synonym-and antonym-specific evaluations (Meyer and Gurevych, 2012a,b).

Evaluation strategy
This work's evaluation considers precision, recall, and variants of the F β score (biasing averages of precision and recall). As there is selection on both source and sink sides, we consider several averaging schemes. For a final evaluation, each sample is averaged at the side-level and averaged across all relationships. Macro-averages compute an unweighted average, while micro-averages weight 4 https://en.wikipedia.org/api/rest_v1/ performance based on the number of definitions involved in the selection process. Intuitively, micro metrics weight based on size, while macro metrics ignore size (treating all potential links and sides as equal).

Setting up baselines
For baselines, we present two types of models, which we refer to as return all and vector similarity. The return all baseline model assumes that for a given relationship link, all definitions belong. This is not intended as a model that could produce a useful network as many definitions and lemmas would be linked that clearly do not belong together. This achieves maximum recall at the expense of precision, demonstrating a base level of precision that must be exceeded.
The vector similarity baseline model takes advantage of semantic vector representations for computing similarity (Bengio et al., 2003;Mikolov et al., 2013;Pennington et al., 2014;. It computes the similarity between lemmas and definitions, utilizing thresholds that flag to either retain similarities above (max), below (min), or with magnitude above the threshold (abs).
Wiktionary features many MWEs and uncommon lemmas requiring use of a vectorization strategy that allows for handling of lemmas not observed in the representation's training. Thus, Fast-Text was selected for its ability to represent out-ofvocabulary lemmas through its bag-of-character ngram modeling . To compute similarity between lemmas and definitions, this model aggregates word vectors of the individual tokens present in a definition. Following other work (Lilleberg et al., 2015;Wu et al., 2018), TF-IDF weighted averages of word vectors were utilized in a very simple averaging scheme.
Initial results indicated that a simple cosine similarity with a linear kernel performed marginally above the return all baseline 5 . Thus, kernel tricks (Cristianini and Shawe-Taylor, 2000) were explored (to positive effect). The Gaussian kernel is often recommended as a good initial kernel to try as a baseline (Schölkopf et al., 1995;Joachims, 1998). It is formulated using a radial basis function (RBF), only dependent on a measure of distance. The Laplacian kernel is a slight variation of the Gaussian kernel, measuring distance as the L1 distance where the Gaussian measures distance as L2 distance. Both kernels fall in the RBF category with a single regularization parameter, γ, and were used in comparison to cosine similarity.
For these kernels, a grid search over γ was conducted from 10 −3 to 10 3 at steps of powers of 10. Similarly, similarity comparison thresholds were considered from −1.0 to 1.0 at steps of 0.05 for all 3 thresholding schemes (min, max, abs).
When selecting a final model, F 1 scores were not considered as recall scores outweighed precision under a simple harmonic mean. This resulted in models with identical performance to the return all model or worse. Instead, models were considered against full-precision and F 0.1 scores.

Semantic Structure Correlation
Creating a wordnet solely from Wiktionary's noisy, crowd-sourced data begs the question: Does the generated network structure resemble the structure present in Princeton's WordNet? To get a sense of this, we compare the capacities of each of these resources as a basis for semantic similarity modeling (using Pearson correlation (Pearson, 1895)). This work considers three notions of graph-based semantic similarity that are present in WordNet: path similarity (PS), Leacock Chodorow similarity (LCH) (Leacock and Chodorow, 1998), and Wu Palmer similarity (WP) (Wu and Palmer, 1994).
The point of this experiment is not to enforce a notion that this network should mirror the structure of WordNet. Given Wiktionary's size, it likely possesses a great deal of information not represented by WordNet (resolved our other experiment on word similarity, Sec. 5.3). But if there is some association between the semantic representation capacities of these two networks we may possibly draw some insight into a more basic question: "has this model produced some relevant semantic structure?" For this experiment, only nouns and verbs are considered as they are the only POS for which WordNet defines these metrics. Additionally, these metrics are defined at the synset level. There is no direct mapping between synsets in our network and WordNet, therefore, scores are consid-ered at a lemma level. By computing values of all pairs of synsets between lemmas, three values per metric are generated: minimum, maximum, and average. Additionally, only lemmas that differ in minimum and maximum similarity are retained, restricting the experiment to the most polysemous portions of the networks. Table 1 shows baseline model performance on the relationship disambiguation task and highlights model parameters. During evaluation, the Laplacian kernel was found to consistently outperform the Gaussian kernel. For this reason, this work presents the scores from the return all baseline and two variants of the Laplacian kernel model-one optimized for precision and the other for F 0.1 .

Baseline model performance
Note that in the synonym case, max-threshold selection performed best, while in the antonym case min-and abs-threshold fared better. This aligns well with the notion that while synonyms are semantically similar, antonyms are semantically anti-similar-an interesting consideration for future model development.
Overall, from the scores in Table 1 one can see that the vector similarity models improve over the return all, but that there is much work to be done to further improve precision and recall.

Comparison against WordNet
WordNet publishes several statistics 6 that one can use for quantitative comparison with the network constructed herein. Reviewing the count statistics shows that Wiktionary is an order of magnitude larger than WordNet and that Wiktionary features 344,789 linked example usages to WordNet's 68,411.
Polysemy. Table 2 report polysemy statistics. Despite the difference in creation processes, the induced networks do not have polysemy averages drastically different from WordNet.
In comparing the three networks induced, there is a common theme of increase in polysemy when shifting from recall to precision. This makes sense due to the fact that the return all model will merge all possible lemmas that overlap in relationship annotations resulting in lower polysemy statistics, whereas a precision-based model will result in pair-wise clusters that do not overlap as broadly, resulting in more complex hierarchies. Structural differences. Intentionally, the presented notion of a semantic hierarchy functions similarly to the hypernym connections within WordNet. Moving up the semantic hierarchy produces sense approximations from definitions that are more general, and moving down the hierarchy produces more specific senses. However, in the induced networks, this is a notion applied to every POS-WordNet only produces these connections for nouns and verbs. An example taken from the F 0.1 network is that of the adjective good (referring to Holy) being subsumed by a synset featuring the adjective proper (referring to suitable, acceptable, and following the established standards).

Word Similarity
In previous works, WordNet and Wiktionary have been used to create vector representations of words. A common method for evaluating the quality of word vectors is performance on word similarity tasks. Performance on these tasks is evaluated through Spearman's rank correlation (Spearman, 2010) between cosine similarity of vector representations and human annotations.
Using Explicit Semantic Analysis (ESA), a technique based on concept vectors, our network constructs vectors using a word's tf-idf scores over concepts, as has been done in prior works (Gabrilovich and Markovitch, 2007;Zesch et al., 2008;Meyer and Gurevych, 2012b). We define our concepts as senses of the F 0.1 network and compute cosine similarity in this representation.
In analyzing these results, the F 0.1 network performs well. Against other ESA methods, it is highly competitive, achieving the highest performance in two datasets. When strictly comparing performance against ESA with WordNet as the source, it has approximately equal or better performance in all datasets except YP-130. We hypothesize that this is due to a lack of precision in verb disambiguation, reinforced by the low polysemy seen above. Additionally, the work from Zesch et al. (2008) evaluated on subsets of the data in which all three resources had coverage. In their work, YP-130 performance is computed for only 80 of the 130 pairs.
Comparing F 0.1 to latent word vectors, it has the highest performance on noun datasets and is competitive on WS-353. While not directly comparable, it achieves this through 26 million tokens of structured text in contrast to billions of tokens of unstructured text that train latent vectors.     on noun similarity tasks, we hypothesize that this indicates better semantic structure for nouns than for verbs, further emphasizing that a possible limitation of the current baseline produced is its lack of precision when it comes to polysemous verbs. However, the positive correlation values seen for nouns, coupled with noun similarity performance, offer strong indications that the F 0.1 does provide useful semantic structure that can be further increased through better modeling.

Future work
Here, several directions are highlighted along which we see this work being extended. Better models. The development of more accurate models for predicting definitions involved in the pair-wise relations will produce more interesting and useful networks, especially with the magnitude of examples of sense usage. Precision of verb relations seems to be a critical component of a better model.
Supervision. Relationship prediction is currently unsupervised. While it is an interesting task to model in this fashion, crowd sourcing the annotation of this data would be possible through services like Amazon Mechanical Turk. This would allow for the potential of exploring supervised models for predicting relationship links, particularly for relationships like synonymy and antonymy which are familiar concepts for a broad community of potential annotators.
WordNet semi-supervision. Another logical transformation of this task would be to use Word-Net to inform the induction of a network in a semisupervised fashion. There are many ways to go about this such as using statistics from WordNet to create a loss function, or using the structure of WordNet as a base. As this work aimed to create a network solely from the data of Wiktionary, these ideas were not explored. However, using WordNet in this fashion is one of the directions of greatest interest for exploration in the future.
Sense usage examples. The examples present in Wiktionary have only begun to be used in this work. When examples are pulled, the source definition and lemma are linked. However, these examples have the potential to be linked to other senses and lemmas. This would an immense amount of structured, sense-usage data that could be used for many machine learning tasks.
Multilingual networks Wiktionary has been explored as a multilingual resource in previous works (de Melo and Weikum, 2009;Gurevych et al., 2012;Meyer and Gurevych, 2012b;Bond and Foster, 2013) largely due to the natural alignment across languages. Extending this approach to a multilingual setting could prove to be extremely useful for machine translation, and could allow low resource languages to benefit from alignment with other languages that have more annotations.

Conclusion
This paper introduced the idea of constructing a wordnet solely using the data from Wiktionary. Wiktionary is a powerful resource, featuring millions of pages that describe lemmas, their senses, example usages, and the relationships between them. Previous work has explored aligning resources like this with other networks like the Princeton WordNet. However, no work has fully explored the idea of building an entire network from the ground up using just Wiktionary.
This work explores simple baselines for constructing a network from Wiktionary through antonym and synonym relationships and compares induced networks with WordNet to find similar structures and statistics that appear to highlight strong future directions of particular interest, including but not limited to improving network modeling, linking more semantic examples, and reinforcing network construction using expert-annotated networks, like WordNet.
As conducted, this work is an initial step in transforming Wikitionary from an open-source dictionary into a powerful tool, dataset, and framework, with the hope of driving and motivating further work at endeavors studying languages and developing language processing systems.