The risk of sub-optimal use of Open Source NLP Software: UKB is inadvertently state-of-the-art in knowledge-based WSD

UKB is an open source collection of programs for performing, among other tasks, knowledge-based Word Sense Disambiguation (WSD). Since it was released in 2009 it has been often used out-of-the-box in sub-optimal settings. We show that nine years later it is the state-of-the-art on knowledge-based WSD. This case shows the pitfalls of releasing open source NLP software without optimal default settings and precise instructions for reproducibility.


Introduction
The release of open-source Natural Language Processing (NLP) software has been key to make the field progress, as it facilitates other researchers to build upon previous results and software easily. It also allows easier reproducibility, allowing for sound scientific progress. Unfortunately, in some cases, it can also allow competing systems to run the open-source software out-of-the-box with suboptimal parameters, specially in fields where there is no standard benchmark and new benchmarks (or new versions of older benchmarks) are created.
Once a paper reports sub-optimal results for a NLP software, newer papers can start to routinely quote the low results from the previous study. Finding a fix to this situation is not easy. The authors of the software can contact the authors of the more recent papers, but it is usually too late for updating the paper. Alternatively, the authors of the NLP software can try to publish a new paper with updated results, but there is usually no venue for such a paper, and, even if published, it might not be noticed in the field.
In this paper we want to report such a case in Word Sense Disambiguation (WSD), where the original software (UKB) was released with suboptimal default parameters. Although the accompanying papers did contain the necessary information to obtain state-of-the-art results, the software did not contain step-by-step instructions, or endto-end scripts for optimal performance. This case is special, in that we realized that the software is able to attain state-of-the-art results also in newer datasets, using the same settings as in the papers.
The take-away message for open-source NLP software authors is that they should not rely on other researchers reading the papers with care, and that it is extremely important to include, with the software release, precise instructions and optimal default parameters, or better still, end-toend scripts that download all resources, perform any necessary pre-processing and reproduce the results.
The first section presents UKB and WSD, followed by the settings and parameters. Next we present the results and comparison to the state-ofthe-art. Section 5 reports some additional results, and finally, we draw the conclusions.

WSD and UKB
Word Sense Disambiguation (WSD) is the problem of assigning the correct sense of a word in a context (Agirre and Edmonds, 2007). Traditionally, supervised approaches have attained the best results in the area, but they are expensive to build because of the need of large amounts of manually annotated examples. Alternatively, knowledge based approaches rely on lexical resources such as WordNet, which are nowadays widely available in many languages (Bond and Paik, 2012) 1 . In particular, graph-based approaches represent the knowledge base as a graph, and apply several well-known graph analysis algorithms to perform WSD.
UKB is a collection of programs which was first released for performing graph-based Word Sense Disambiguation using a preexisting knowledge base such as WordNet, and attained state-of-the-art results among knowledge-based systems when evaluated on standard benchmarks Agirre et al., 2014). In addition, UKB has been extended to perform disambiguation of medical entities (Agirre et al., 2010), named-entities (Erbs et al., 2012;, word similarity  and to create knowledge-based word embeddings (Goikoetxea et al., 2015). All programs are open source 2 ,3 and are accompanied by the resources and instructions necessary to reproduce the results. The software is quite popular, with 60 stars and 26 forks in github, as well as more than eight thousand direct downloads from the website since 2011. The software is coded in C++ and released under the GPL v3.0 license.
When UKB was released, the papers specified the optimal parameters for WSD Agirre et al., 2014), as well as other key issues like the underlying knowledge-base version, specific set of relations to be used, and method to pre-process the input text. At the time, we assumed that future researchers would use the optimal parameters and settings specified in the papers, and that they would contact the authors if in doubt. The default parameters of the software were not optimal, and the other issues were left under the users responsibility.
The assumption failed, and several papers reported low results in some new datasets (including updated versions of older datasets), as we will see in the following sections.

UKB parameters and setting for WSD
When using UKB for WSD, the main parameters and settings can be classified in five main categories. For each of those we mention the best options and the associated UKB parameter when relevant (in italics), as taken from Agirre et al., 2014): • Pre-processing of input text. When running UKB for WSD, one needs to define which 2 http://ixa2.si.ehu.eus/ukb 3 https://github.com/asoroa/ukb window of words is to be used as context to initialize the random walks. One option is to take just the sentence, but given that in some cases the sentences are very short, better results are obtained when considering previous and following sentences. The procedure in the original paper repeated the extension procedure until the total length of the context is at least 20 words 4 .
• Knowledge base relations. When performing WSD for English, UKB uses Word-Net (Fellbaum, 1998) as a knowledge base. WordNet comes in various versions, and usually UKB performs best when using the same version the dataset was annotated with. Besides regular WordNet relations, gloss relations (relations between synsets appearing in the glosses) have been shown to be always helpful.
• Graph algorithm. UKB implements different graph-based algorithms and variants to perform WSD. These are the main ones: ppr w2w: apply personalized PageRank for each target word, that is, perform a random walk in the graph personalized on the word context. It yields the best results overall, at the cost of being more time consuming that the rest. ppr: same as above, but apply personalized PageRank to each sentence only once, disambiguating all content words in the sentence in one go. It is thus faster that the previous approach, but obtains worse results. dfs: unlike the two previous algorithms, which consider the WordNet graph as a whole, this algorithm first creates a subgraph for each context, following the method first presented in Navigli and Lapata (2010), and then runs the PageRank algorithm over the subgraph. This option represents a compromise between ppr w2w and ppr, as it faster than than the former while better than the latter.
• Use of sense frequencies (dict weight). Sense frequencies are a valuable piece of information that describe the frequencies of the associations between a word and its possible senses. The frequencies are often derived from manually sense annotated corpora, such as Semcor (Miller et al., 1993). We use the sense frequency accompanying Wordnet, which, according to the documentation, "represents the decimal number of times the sense is tagged in various semantic concordance texts". The frequencies are smoothed adding one to all counts (dict weight smooth). The sense frequency is used when initializing context words, and is also used to produce the final sense weights as a linear combination of sense frequencies and graph-based sense probabilities. The use of sense frequencies with UKB was introduced in (Agirre et al., 2014).

Comparison to the state-of-the-art
We evaluate UKB on the recent evaluation dataset described in (Raganato et al., 2017a). This dataset comprises five standard English all-words datasets, standardized into a unified format with gold keys in WordNet version 3.0 (some of the original datasets used older versions of WordNet). The dataset contains 7, 253 instances of 2, 659 different content words (nouns, verbs, adjectives and adverbs). The average ambiguity of the words in the dataset is of 5.9 senses per word. We report F1, the harmonic mean between precision and recall, as computed by the evaluation code accompanying the dataset.
The two top rows in Table 1 show conflicting results for UKB. The first row corresponds to UKB ran with the settings described above. The second row was first reported in (Raganato et al., 2017a). As the results show, that paper reports a suboptimal use of UKB. In more recent work, Chaplot and Sakajhutdinov (2018) take up that result and report it in their paper as well. The difference is of nearly 10 absolute F1 points overall. 5 This decrease could be caused by the fact that Raganato et al. (2017a) did not use sense frequencies.
In addition to UKB, the table also reports the 5 Note that the UKB results for S2, S3 and S07 (62.6, 63.0 and 48.6 respectively) are different from those in (Agirre et al., 2014), which is to be expected, as the new datasets have been converted to WordNet 3.0 (we confirmed experimentally that this is the sole difference between the two experiments).  best performing knowledge-based systems on this dataset. Raganato et al. (2017a) run several wellknown algorithms when presenting their datasets. We also report (Chaplot and Sakajhutdinov, 2018), the latest work on this area, as well as the most frequent sense as given by WordNet counts (see Section 3). The table shows that UKB yields the best overall result. Note that Banerjee and Pedersen (2003) do not use sense frequency information. For completeness, Table 2 reports the results of supervised systems on the same dataset, taken from the two works that use the dataset (Yuan et al., 2016;Raganato et al., 2017b). As expected, supervised systems outperform knowledge-based systems, by a small margin in some of the cases.

Additional results
In addition to the results of UKB using the setting in Agirre et al., 2014) as specified in Section 3, we checked whether some reasonable settings would obtain better results. Table 3 shows the results when applying the three algorithms described in Section 3, both with and without sense frequencies, as well as using a single sentence for context or extended context. The table shows that the key factor is the use of sense frequencies, and systems that do not use them (those with a nf subscript) suffer a loss between 7 and 8 percentage points in F1. This would explain part of the decrease in performance reported in (Raganato et al., 2017a), as they explicitly mention that they did not activate the use of sense frequencies in UKB.
The table also shows that extending the context is mildly effective. Regarding the algorithm, the table confirms that the best method is ppr w2w, followed by the subgraph approach (dfs) and ppr.

Conclusions
This paper presents a case where an open-source NLP software was used with suboptimal parameters by third parties. UKB was released with suboptimal default parameters, and although the accompanying papers did describe the necessary settings for good results on WSD, bad results were not prevented. The results using the settings described in the paper on newly released datasets show that UKB is the best among knowledgebased WSD algorithms.
The take-away message for open-source NLP software authors is that they should not rely on other researchers reading the papers with care, and that it is extremely important to include, with the software release, precise instructions and optimal default parameters, or better still, end-toend scripts that download all resources, perform any necessary pre-processing and reproduce the results. UKB now includes in version 3.1 such end-to-end scripts and the appropriate default parameters.