Metaheuristic Approaches to Lexical Substitution and Simplification

In this paper, we propose using metaheuristics—in particular, simulated annealing and the new D-Bees algorithm—to solve word sense disambiguation as an optimization problem within a knowledge-based lexical substitution system. We are the first to perform such an extrinsic evaluation of metaheuristics, for which we use two standard lexical substitution datasets, one English and one German. We find that D-Bees has robust performance for both languages, and performs better than simulated annealing, though both achieve good results. Moreover, the D-Bees–based lexical substitution system outperforms state-of-the-art systems on several evaluation metrics. We also show that D-Bees achieves competitive performance in lexical simplification, a variant of lexical substitution.


Introduction
Lexical substitution is a special case of automatic paraphrasing in which the goal is to provide contextually appropriate replacements for a given word, such that the overall meaning of the context is maintained. The task has applications in question answering, text summarization, sentence compression, information extraction, machine translation, and natural language generation (Androutsopoulos and Malakasiotis, 2010). It is also frequently employed as an in vivo evaluation of word sense disambiguation (WSD) systems (McCarthy and Navigli, 2009;Toral, 2009;Miller et al., 2015), because while lexical substitution requires words to be sense-disambiguated, it does not impose use of a predefined sense inventory.
Past work in WSD, whether or not it forms part of a lexical substitution system, has employed a wide range of approaches (Agirre and Edmonds, 2007). Supervised methods usually achieve the best results, but at the tremendous cost of producing manually annotated training data specific to the language and domain. Knowledge-based and unsupervised methods rely only on pre-existing resources such as machine-readable dictionaries and raw corpora. Though generally less accurate, they have the advantage of being more flexible and more adaptable to new languages and domains. For knowledge-based methods, this has been especially true since the advent of large, multilingual, collaboratively constructed resources such as Wikipedia and Wiktionary (Zesch et al., 2008).
In this paper, we present two novel approaches to lexical substitution which are knowledge-based, generally language-independent, and use a combination of traditional wordnets and Wiktionary. The first approach uses simulated annealing (Kirkpatrick et al., 1983), which was first proposed for use in WSD by Cowie et al. (1992) but has attracted relatively little attention since then. The second approach uses D-Bees (Abualhaija and Zimmermann, 2016), a relatively new, biologically inspired disambiguation algorithm that models swarm intelligence. Both algorithms are metaheuristic (Talbi, 2009) in that they treat WSD as an optimization problem and modify heuristic (approximate) solu-tions to avoid entrapment in local optima. Ours is the first extrinsic evaluation of any metaheuristic approaches to WSD in a lexical substitution setting.
We evaluate and compare both approaches on two lexical substitution datasets, one English and one German. We find that both approaches perform well, with D-Bees in particular exceeding state-ofthe-art performance in many tasks. We also apply the systems to lexical simplification, a variant of lexical substitution in which the goal is to provide substitutes which are easier to understand. Here, too, we find that D-Bees performs near or above the state of the art.

Lexical Substitution and Simplification
In lexical substitution, a system is given a word in context and tasked with producing a list of words that could be substituted for the word without altering the overall meaning. For example, given the word "bright" in the sentence "Einstein was a bright man," valid substitutes would include "sharp" and "intelligent", but not "shiny" or "luminous", even though the latter two are synonymous with "bright" in other contexts. It is generally expected that the list of substitutes be ordered by acceptability. Most lexical substitution systems therefore comprise two distinct phases: generation, in which the system assembles a set of suitable substitutes for the target word, and ranking, in which the system orders them according to how well they fit the context.
There have been a number of organized evaluation campaigns for lexical substitution systems, including the English-language task at SemEval-2007 (McCarthy andNavigli, 2009) and the German task at GermEval 2015 (Miller et al., 2015). These campaigns provide standardized datasets where a large number of word-context combinations have been manually annotated with acceptable substitutes. Systems are evaluated by comparing their output to this gold standard, using any or all of three scoring methodologies: • In the best methodology (McCarthy and Navigli, 2009), systems are allowed to suggest as many substitutes as they wish. However, the credit for each guess is normalized by the total number of guesses. The best guess should be placed first in the list. Across the entire dataset, four metrics are calculated: recall (R), mode recall (R m ), precision (P), and mode precision (P m ). 1 • In out of ten (OOT) (McCarthy and Navigli, 2009), systems suggest up to ten substitutes, though neither the exact number nor the order of these is important. This methodology uses minor variations of best's R, R m , P, and P m .
• Generalized average precision (GAP) (Kishida, 2005) uses a single metric to score a fully ranked list of substitutes. Unlike OOT, GAP is sensitive to the relative positions of the correct and incorrect substitutes in the list.
For reasons of space, we do not provide detailed explanations and formulas for the nine metrics, but refer readers to the cited papers. Lexical simplification is a variant of lexical substitution in which the correct ranking is determined not just by the substitutes' contextual fitness but also by their simplicity. (For example, rare words are generally considered to be more complex, as readers are less likely to be familiar with their meanings.) As with other types of text simplification, lexical simplification can be used to make complex texts understandable by a wider range of readers, such as children or second language learners.
To date there has been one shared task in lexical simplification . Its main evaluation metric is based on Cohen's (1960) κ. Two post-hoc evaluation metrics are also used. The first, top-ranked (TRnk), evaluates the simplest set of substitutes that is ranked first by the system, compared with the top-ranked set of substitutes in the gold standard. This represents the intersection between the first substitute set found by the system with the first set in the gold standard. The intersection should include at least one substitute. The second metric, recall at n (R@n) is the ratio of candidates from the top n sets of substitutes to those in the gold standard, where 1 ≤ n ≤ 3. For a given n, the contexts with at least n + 1 substitutes in the gold standard are considered.

Word Sense Disambiguation, Optimization, and Metaheuristics
Word sense disambiguation, the task of determining which of a word's meanings is the one intended in a given context, is a prerequisite for generating substitutes in knowledge-based lexical substitution.  There are many different approaches to WSD; for our purposes it is convenient to define it as an optimization problem where the aim is to disambiguate a sequence of words simultaneously (Abualhaija and Zimmermann, 2016): Let W = (w 1 , w 2 , . . . , w n ) be a sequence of n words to be disambiguated, and σ = (s 1 , s 2 , . . . , s n ) the corresponding sequence of senses for each word. Let S = {σ 1 , . . . , σ m } be the set of all sequences of senses that represent sense combinations of the words in W . Then the objective function is arg max σ ∈S (σ ), where is the score assigned to a sequence of senses according to some measure of semantic similarity, such as those surveyed by Zesch and Gurevych (2010). WSD as an optimization problem is NP-hard. This can be worked around by using metaheuristics, which are approximate, tractable algorithms that find near-optimal solutions. Metaheuristics can be single-solution and population-based search methods. The former manipulate and transform a single solution, giving more focus to the promising regions. Population-based methods work on multiple solutions, distributing their focus and exploring several regions of the search space simultaneously.

Approach
We investigate two knowledge-based, languageindependent approaches to lexical substitution, whose main difference lies in the metaheuristic WSD component preceding the generation phase. Both approaches use a top-down generation process, in which the target word is first disambiguated in context with respect to a particular sense inventory, and then used to suggest a list of substitutes. 2 In the following subsections, we describe the two disambiguation components and the common substitute generation and ranking components. (See overview in Figure 1.)

Disambiguation with Simulated Annealing
Simulated annealing (Kirkpatrick et al., 1983) is a single-solution algorithm in which a randomly created solution is iteratively modified until a "good-enough" solution is found. To apply it to WSD, we use essentially the same setup as Cowie et al. (1992). We start with a randomly initialized sense combination σ 0 = (s 1 , s 2 , . . . , s n ) from a given sense inventory, for each word in the context. We then retrieve the glosses for each sense, preprocess them via lemmatization and stop word removal, and give each remaining term a score of n − 1 if it appears n times. We calculate the configuration's redundancy, R 0 , by summing up all the scores. In other words, R 0 is the lexical overlap between sense definitions. The aim of simulated annealing is to maximize this overlap, or more precisely to minimize the energy function In this iterative process, each iteration makes a random change on the configuration σ i to produce σ i+1 , on which the corresponding E i is computed. If E i+1 < E i (i.e., ∆E < 0), then the new configuration replaces the old configuration for the next iteration. Otherwise, the new configuration might still be accepted with probability Pr = exp (−∆E/T ), where T is initially set to 1 but replaced with 0.9T for each subsequent iteration. This way, the algorithm risks exploring poor-looking paths that might nonetheless yield better results in the long run, and the earlier the iterations are, the greater the probability that a poor path is followed. In our experiments we iterate up to 30 times.

Disambiguation with D-Bees
D-Bees (Abualhaija and Zimmermann, 2016) is a population-based algorithm inspired by bee colony optimization (BCO) (Teodorović, 2009). BCO models the foraging behaviour of honey bees, where thousands of individuals with limited knowledge collaborate to maximize their collective bene-fit. In nature, bees fly around their hive to look for nectar and pollen. When they find it, they return to the hive and perform a dance to advertise its location and quality to the others. The observers then decide whether to remain committed to their own path or to abandon it in favour of one of the advertised paths. BCO simulates this method through a multi-agent decentralized system. D-Bees starts by choosing one of the target words as the hive, which spawns bee agents and sends them to other words in the context. The number of bee agents equals the number of candidate senses of the hive; each bee agent starts off with one of these senses in its memory. For each word it visits, the bee disambiguates it by randomly selecting a candidate sense, building up a path of senses and maintaining a running total similarity score. This forward pass continues until a set number of moves is reached.
The bee then makes a backward pass to the hive and exchanges its partial solution with the other agents on the virtual dancing floor. Each bee then determines whether it should stick to its path or adopt that of another bee; this is accomplished through loyalty and recruiting probability functions that depend mainly on the quality of the partial solutions. On the next forward pass, the bees resume their searches from the ends of their chosen paths. The forward and backward passes are alternated until there are no more words to be disambiguated. The bee agent with the best solution determines the final sense labelling of all words in the context.
In experiments on separate tuning datasets, we determined the number of moves in the forward pass to be one-third the number of context words. For the calculation of semantic similarity, we use a variant of the adapted Lesk algorithm (Banerjee and Pedersen, 2002). For each sense, we build a textual representation by concatenating its gloss with those of its hyper-and hyponyms. We then calculate the lexical overlap between the two texts.

Substitute Generation
Once the target word is disambiguated with respect to a particular sense inventory, we generate an unordered list of substitutes (to be subsequently ordered by the ranking module). The sense inventory we use for disambiguation is WordNet 3.1 (Fellbaum, 1998) for our English tasks, and Germa-Net 10.0 (Hamp and Feldweg, 1997;Henrich and Hinrichs, 2010) for the German one. These are expert-built resources in which words representing the same concept are grouped together into synsets; synsets are in turn linked into a network by semantic relations such as hypernymy and meronymy.
In preliminary experiments on generating substitutes, we varied two independent parameters: which lexical-semantic resources to use as the source of substitutes, and which semantic relations to follow from the disambiguated synset.
With respect to the first parameter, we tried drawing substitutes from the disambiguation inventory (WordNet or GermaNet) alone, and also drawing additional substitutes from Wiktionary. Our use of Wiktionary as a complementary resource is motivated by Meyer and Gurevych (2012), who found its coverage to be complementary to those of expert-built resources, and by Henrich and Hinrichs (2012), who found that using information from both GermaNet and Wiktionary improved WSD performance. We used a relatively simple, Lesklike method for mapping senses from WordNet/ GermaNet to Wiktionary.
For the second parameter, we tried one setup in which we took all synonyms found in the disambiguated synset and in its hypernyms, and one in which we additionally pulled in synonyms from the hyponyms and all other related synsets (except antonyms). The first setup was informed by the annotation guidelines of the lexical substitution datasets, which indicate that it is permissible to suggest substitute terms that are more generic but not more specific. The second setup was informed by the analyses of Kremer et al. (2014) and Miller et al. (2016), which found, contrarily, that other semantic relations, including hyponyms, were a fruitful source of substitutes.
We obtained the best overall results when using both WordNet/GermaNet and Wiktionary, and when following semantic relations of all types (other than antonymy), to build the substitute list. We therefore used this setup for all our lexical substitution and simplification experiments.

Ranking
The final step of lexical substitution is to rank the substitutes. Our method, like those employed in previous lexical substitution tasks, assumes that a substitute's suitability depends on the type of its semantic relation to the target word. We therefore order the substitutes as follows: synonyms, hypernyms, hyponyms, other relations. Within each semantic relation type, we sort the substitutes first by source (first WordNet/GermaNet, then Wiktionary), and then secondarily by reverse frequency in a large corpus. In preliminary experiments, we found that this method was generally better than simply sorting the entire substitute list by reverse frequency. To determine lemma frequency, we use the same frequency lists used to construct the original datasets: WaCky (Baroni et al., 2009) for German, and BNC (Burnard, 2007) for English.

Dataset and Baselines
In our experiments, we use the data from Germ-Eval 2015 (Miller et al., 2015), a shared task for German-language lexical substitution. It is split into a training and a test set of 1040 and 1000 sentences from the German edition of Wikipedia. Each sentence in the dataset contains one of 75 unique target words (25 nouns, 25 verbs, and 25 adjectives); in the test set, ten sentences are provided for each of the nouns and adjectives, and twenty for each verb. Miller et al. (2015) report results of several naïve baselines, the best-performing of which are weighted sense (Toral, 2009) and top-ranked synonym (McCarthy and Navigli, 2009). Neither baseline makes any attempt to disambiguate the target word; rather, they build a substitute list by gathering synonyms of all possible senses of the target, as well as synonyms of closely related senses such as hypernyms, and then ranking these words by their frequency (either within the list itself or in a large corpus). We consider these two naïve baselines as reasonable lower bounds.
The more challenging baseline performance comes from the best-performing participating systems at GermEval 2015, which represent the state of the art in German-language lexical substitution. One of these systems (Hintz and Biemann, 2015) is a supervised, bottom-up approach inspired by previous English-language work by Szarvas et al. (2013a). It first retrieves a list of substitutes from various lexicons, then applies a maxent classifier to determine whether each substitute fits the context. The second system (Jackov, 2015) is based on techniques from machine translation. It first disambiguates the input text by mapping German words to concepts represented by WordNet synsets. It then produces and scores various parsing hypotheses, and selects the synonyms and hypernyms of the target in the best-scoring hypothesis. Table 1 shows the results of the baselines described above, along with those of our basic D-Bees-and simulated annealing-based systems, and an enhanced version of the D-Bees system that we describe below. 3 Both our basic systems outperform the prior state of the art for the four OOT metrics, with the D-Bees-based system performing slightly better than the one using simulated annealing. However, neither system was able to beat Hintz and Biemann (2015) for the GAP and best metrics.

Results and Analysis
In light of this gap, we modified the D-Beesbased system to account for some idiosyncrasies of our German-language resources: • Where GermaNet provided additional spellings of a synonym (e.g., "wacklig" for "wackelig"), we placed the variant spellings at the end of the substitute list. This prevented the top ranks of the list from being overloaded with nearly identical terms.
• Where our resources provided gender-specific variants of a synonym, we filtered out those that did not match the gender of the target. For example, when building the substitute list for "Meisterin" (female champion), we exclude "Meister" (male champion), even though GermaNet lists it as a synonym.
• To control for Wiktionary's lack of consistency, we filtered out Wiktionary-derived synonyms where the synonymy relation was not symmetric. For example, the Wiktionary entry for "Likör" gives "Crème" as a synonym, but the entry for "Crème" does not give "Likör", so when building a substitute list for "Likör", we do not include "Crème".
With these resource-specific enhancements, the D-Bees system achieves state-of-the-art performance not only for OOT but also for GAP, and performs only slightly worse than Hintz and Biemann (2015) for best. (This is an impressive result considering that Hintz and Biemann (2015) is a supervised system while ours is based solely on external knowledge bases and does not require any training data.) We also examined its performance by part of speech. We found that it remains the  (2015) on some best metrics.

Dataset and Baselines
Our English-language data is taken from the SemEval-2007 shared task (McCarthy and Navigli, 2009). That task uses a sample of 201 target words (nouns, verbs, adjectives and adverbs); for each word, ten context sentences are selected from the English Internet Corpus (Sharoff, 2006). Five human annotators provided up to three substitutes for each target. The dataset is split into a training set (300 sentences) and a test set (1710 sentences).
McCarthy and Navigli (2009) provide results for the aforementioned "top-ranked synonyms" algorithm as a lower bound on performance. State-ofthe-art performance across the nine evaluation metrics is represented by the top-performing systems at SemEval-2007(Giuliano et al., 2007Hassan et al., 2007;Yuret, 2007;Zhao et al., 2007) and by several later systems (Biemann and Riedl, 2013;Melamud et al., 2015). 4 Of these systems, only Yuret (2007) is supervised. Table 2 shows the results for the state-of-the-art and naïve baselines, along with results of our two basic systems and, as before, an enhanced version 4 We are aware of several further lexical substitution systems (Moon and Erk (2013),Ó Séaghdha and Korhonen (2014), Roller and Erk (2016), Sinha and Mihalcea (2011), Szarvas et al. (2013b), and Thater et al. (2010 as reimplemented by Kremer et al. (2014)), though they do not report results on the full SemEval-2007 test set, or else do not report any of the same metrics we do, or else are concerned only with ranking but not generating substitutes. of the D-Bees system. Our systems' performance is generally much lower here than on the Germanlanguage data, with D-Bees failing to exceed the state of the art.

Results and Analysis
As with our German experiments, we tried modifying the D-Bees-based system to work around the language-specific problems we observed. The most significant of these adaptations are as follows: • Our analysis suggested that WordNet's notoriously fine sense granularity was adversely affecting the WSD process. We therefore modified D-Bees to perform "soft" WSD (Ramakrishnan et al., 2004), meaning that we allow it to select several different senses as the correct ones-in our case, up to five. To compensate for the larger number of substitution candidates, we limit the ranked list of substitutes to 20. (This hearkens back to the bottom-up approaches defined in §3.) Substitutes generated from the best disambiguation solution are ranked highest.
• In contrast to German, English lexical substitutes are often drawn from indirect hypernyms (Kremer et al., 2014;Miller et al., 2016). (This too may be an artifact of WordNet's fine granularity.) We therefore extended our substitute search to two levels of hypernyms.
• The glosses provided by WordNet sometimes consist of a list of equivalent terms which do not appear in the list of synonyms. For example, WordNet defines one sense of the adverb "right" as "precisely, exactly", though it does not actually list those words as synonyms. We therefore include as the lowest-ranked substitutes those words from the target's gloss that match its part of speech. • As WordNet contains no hypernymy relations for adjectives, for our purposes we use its "similar-to" relation instead.
• For word frequency, we generally prefer the counts provided by WordNet, since they are sense-disambiguated. (This use of manually sense-annotated data makes our approach weakly supervised.) In other cases, such as when ranking substitutes from Wiktionary, we use Web 1T (Brants and Franz, 2006) instead of BNC. Web 1T is a much larger, more modern, Web-derived corpus that may better reflect the lemma distributions in the Webderived SemEval-2007 dataset.
The enhanced D-Bees-based system performs significantly better than the base system, though in common with the two post-SemEval-2007 systems, it still fails to surpass the state of the art for best and OOT. The two knowledge-based systems that outperform our system by a large margin, Giuliano et al. (2007) and Hassan et al. (2007), employ particularly strong substitute generation components that use a combination of WordNet with a rich thesaurus resource-the Oxford American Writer Thesaurus and the Microsoft Encarta encyclopedia, respectively. Both resources outperform Wiktionary in terms of coverage of synonyms and semantically related words. However, as these resources are proprietary, they were not available to us.
Our system's performance is roughly on par with Zhao et al. (2007), another bottom-up approach. Our enhanced system does achieve the highest known GAP score, though this is largely because most prior work does not use this metric, or else applies it only to the ranking of gold-standard substitutes.

Experimental Setup
Our experiments use the dataset from the SemEval-2012 English lexical simplification task . It uses the same contexts and target words as the SemEval-2007 dataset, but the goldstandard substitutes, which include the original target words, have been manually re-ranked according to their perceived simplicity. Unlike SemEval-2007, the SemEval-2012 task is concerned exclusively with ranking substitutes; all the original participating systems were given the gold-standard substitutes and simply asked to put them in the correct order. However, to score our own systems we use their own substitute lists, removing only those substitutes that do not also appear in the goldstandard list. This puts us at somewhat of a disadvantage, since our substitute lists often contain only a subset of the gold-standard substitutes. It also makes use of the κ metric problematic, since κ expects the system and gold-standard lists to contain the same set of substitutes. We therefore report only TRnk and R@n scores.  report scores for two lowerbound baselines: one puts the substitute lists in random order, and the other orders them by inverse frequency of occurrence in Web 1T. 5 The state of the art is represented by Jauhar and Specia (2012), We first calculated the proportion of instances for which our systems suggested at least one substitute appearing in the gold standard (other than the target word itself). For the simulated annealing system, the percentage was 45.7%, for the D-Bees system it was 58.7%, and for the enhanced D-Bees system, it was 81.6%. We tentatively conclude that the soft WSD of enhanced D-Bees is necessary to generate sufficient numbers of substitutes in common with the gold standard, and exclude our other two systems from further consideration.
Since the SemEval-2012 lexical simplification task is concerned only with ranking, we test three different rankings of the enhanced D-Bees substitute list. First, we preserve the original order of the system. Second, we order by unigram frequency in Web 1T, as in the SemEval-2012 baseline. Our third ranking is an n-gram ordering approach that we found to work well (κ = 0.461) on the full gold-standard substitute lists. Here the substitutes are sorted according to the summation of the combined frequency of the substitute and context words. More formally, let W be the set of all unique words in the context window, excluding the target w t , and let S be the set of substitutes for w t . Then each substitute s ∈ S is given a score where f (s, w) is the Web 1T co-occurrence frequency for s and w. The list of substitutes is then sorted by descending score. Table 3 shows the published results for our baselines, along with the results from the enhanced D-Bees-based system from §5.2 using various ranking methods. While none of our configurations scored particularly well on TRnk, all of them surpassed the state of the art for R@1 and R@2, and performed about as well as Jauhar and Specia (2012) for R@3. These results are particularly impressive in light of the fact that the SemEval-2012 systems had access to the gold-standard substitutes, whereas our systems did not.

Results and Analysis
The good R@n scores when using the original ordering indicate that the D-Bees-based system is (quite serendipitously) predisposed to selecting simple substitutes and ranking them relatively highly. We note that there is relatively little difference between our three system configurations, suggesting that all three ranking methods are doing more or less the same thing, at least for the first few substitutes. This result is somewhat surprising in light of Specia et al.'s (2012) assumption that the notion of simplicity is context-dependent. (It is this notion that our n-gram-based ranking model was attempting to capture.) It could be that, for our systems, the context (including text complexity) is already sufficiently accounted for during WSD.

Conclusion
In this paper, we have presented the first extrinsic evaluations of simulated annealing and D-Bees in a lexical substitution setting. We used each algorithm as the WSD component in the same knowledgebased, language-independent lexical substitution system. The systems were tested on German and English datasets, and surpassed state-of-the-art performance on the former. The D-Bees system generally had better results, so we applied some resourcespecific adaptations based on our own observations of GermaNet and WordNet, as well as on previ-ously published studies on German and English lexical substitution. These adaptations led to dramatic improvements in performance on both datasets. We also tested the adapted D-Bees system in a lexical simplification setting, where (in spite of some handicaps) it exceeded state-of-the-art performance on two evaluation metrics. Our findings would seem to validate the utility of metaheuristic approaches for lexical substitution and simplification, with the caveat that optimal performance is achieved only when the systems are adapted to the language or linguistic resources used. This adaptation effort may nonetheless be lower than that required to source annotated training data for supervised approaches.
Regarding future work, there are several issues of interest. The first concerns our use of collaboratively constructed language resources. While our WSD components used only expert-built resources, we found it beneficial to draw additional substitution candidates from Wiktionary. For this we used a very basic sense alignment technique, though a more profound sense mapping between WordNet/GermaNet and Wikitionary, such as those surveyed by , might lead to better downstream results. The approach D-Bees uses for calculating sense similarity is also quite basic; though it seemed to work well in practice, we are keen to investigate other methods, such as taking the WordNet/GermaNet graph structure into account, or using other measures of text similarity to compare glosses.