One Million Sense-Tagged Instances for Word Sense Disambiguation and Induction

Supervised word sense disambiguation (WSD) systems are usually the best performing systems when evaluated on standard benchmarks. However, these systems need annotated training data to function properly. While there are some publicly available open source WSD systems, very few large annotated datasets are available to the research community. The two main goals of this paper are to extract and annotate a large number of samples and release them for public use, and also to evaluate this dataset against some word sense disambiguation and induction tasks. We show that the open source IMS WSD sys-tem trained on our dataset achieves state-of-the-art results in standard disambiguation tasks and a recent word sense induction task, outperforming several task sub-missions and strong baselines.


Introduction
Identifying the meaning of a word automatically has been an interesting research topic for a few decades. The approaches used to solve this problem can be roughly categorized into two main classes: Word Sense Disambiguation (WSD) and Word Sense Induction (WSI) (Navigli, 2009). For word sense disambiguation, some systems are based on supervised machine learning algorithms (Lee et al., 2004;Zhong and Ng, 2010), while others use ontologies and other structured knowledge sources (Ponzetto and Navigli, 2010;Agirre et al., 2014;Moro et al., 2014).
There are several sense-annotated datasets for WSD (Miller et al., 1993;Ng and Lee, 1996;Passonneau et al., 2012). However, these datasets either include few samples per word sense or only cover a small set of polysemous words.
To overcome these limitations, automatic methods have been developed for annotating training samples. For example, Ng et al. (2003), Chan and Ng (2005), and Zhong and Ng (2009) used Chinese-English parallel corpora to extract samples for training their supervised WSD system. Diab (2004) proposed an unsupervised bootstrapping method to automatically generate a senseannotated dataset. Another example of automatically created datasets is the semi-supervised method used in (Kübler and Zhekova, 2009), which employed a supervised classifier to label instances.
The two main contributions of this paper are as follows. First, we employ the same method used in Chan and Ng, 2005) to semi-automatically annotate one million training samples based on the WordNet sense inventory (Miller, 1995) and release the annotated corpus for public use. To our knowledge, this annotated set of sense-tagged samples is the largest publicly available dataset for word sense disambiguation. Second, we train an open source supervised WSD system, IMS (Zhong and Ng, 2010), using our data and evaluate it against standard WSD and WSI benchmarks. We show that our system outperforms other state-of-the-art systems in most cases. As any WSD system is also a WSI system when we treat the pre-defined sense inventory of the WSD system as the induced word senses, a WSD system can also be evaluated and used for WSI. Some researchers believe that, in some cases, WSI methods may perform better than WSD systems (Jurgens and Klapaftis, 2013;Wang et al., 2015). However, we argue that WSI systems have few advantages compared to WSD methods and according to our results, disambiguation systems consistently outperform induction systems. Although there are some cases where WSI systems can be useful (e.g., for resource-poor languages), in most cases WSD systems are preferable because of higher accuracy and better interpretability of output.
The rest of this paper is composed of the following sections. Section 2 explains our methodology for creating the training data. We evaluate the extracted data in Section 3 and finally, we conclude the paper in Section 4.

Training Data
In order to train a supervised word sense disambiguation system, we extract and sense-tag data from a freely available parallel corpus, in a semiautomatic manner. To increase the coverage and therefore the ultimate performance of our WSD system, we also make use of existing sense-tagged datasets. This section explains each step in detail.
Since the main purpose of this paper is to build and release a publicly available training set for word sense disambiguation systems, we selected the MultiUN corpus (MUN) (Eisele and Chen, 2010) produced in the EuroMatrixPlus project 1 . This corpus is freely available via the project website and includes seven languages. An automatically sentence-aligned version of this dataset can be downloaded from the OPUS website 2 and therefore we decided to extract samples from this sentence-aligned version.
To extract training data from the MultiUN parallel corpus, we follow the approach described in (Chan and Ng, 2005) and select the Chinese-English part of the MultiUN corpus. The extraction method has the following steps:

Tokenization and word segmentation: The
English side of the corpus is tokenized using the Penn TreeBank tokenizer 3 , while the Chinese side of the corpus is segmented using the Chinese word segmenter of (Low et al., 2005).
2. Word alignment: After tokenizing the texts, GIZA++ (Och and Ney, 2000) is used to align English and Chinese words.
4. Annotation: In order to assign a WordNet sense tag to an English word w e in a sentence, we make use of the aligned Chinese translation w c of w e , based on the automatic word alignment formed by GIZA++. For each sense i of w e in the WordNet sense inventory (WordNet 1.7.1), a list of Chinese translations of sense i of w e has been manually created. If w c matches one of these Chinese translations of sense i, then w e is tagged with sense i.
The average time needed to manually assign Chinese translations to the word senses of one word type for noun, adjective, and verb is 20, 25, and 40 minutes respectively (Chan, 2008). The above procedure annotates the top 60% most frequent word types (nouns, verbs, and adjectives) in English, selected based on their frequency in the Brown corpus. This set of selected word types includes 649 nouns, 190 verbs, and 319 adjectives.
Since automatic sentence and word alignment can be noisy, and a Chinese word w c can occasionally be a valid translation of more than one sense of an English word w e , the senses tagged using the above procedure may be erroneous. To get an idea of the accuracy of the senses tagged with this procedure, we manually evaluated a subset of 1,000 randomly selected sense-tagged instances. Although the sense inventory is finegrained (WordNet 1.7.1), the sense-tag accuracy achieved is 83.7%. We also performed an error analysis to identify the sources of errors. We found that only 4% of errors are caused by wrong sentence or word alignment. However, 69% of erroneous sense-tagged instances are the result of a Chinese word associated with multiple senses of a target English word. In such cases, the Chinese word is linked to multiple sense tags and therefore, errors in sense-tagged data are introduced. Our results are similar to those reported in (Chan, 2008).
To speed up the training process, we perform random sampling on the sense tags with more than 500 samples and limit the number of samples per sense to 500. However, all samples of senses with fewer than 500 samples are included in the training data. This sampling method ensures that rare sense tags also have training samples during the selection process.
In order to improve the coverage of the training set, we augment it by adding samples from SEM-COR (SC) (Miller et al., 1993) , 1996). We only add the 28 most frequent adverbs from SEMCOR because we observe almost no improvement when adding all adverbs. We notice that the DSO corpus generally improves the performance of our system. However, since the annotated DSO corpus is copyrighted, we are unable to release a dataset including the DSO corpus. Therefore, we experiment with two different configurations, one with the DSO corpus and one without, although the released dataset will not include the DSO corpus.
Since some shared tasks use newer WordNet versions, we convert the training set sense labels using the sense mapping files provided by Word-Net 5 . As replicating our results requires WordNet versions 1.7.1, 2.1, and 3.0, we release our sensetagged dataset in all three versions. Some statistics about the sense-tagged training set can be found in Table 1 to Table 3.

Evaluation
For the WSD system, we use IMS (Zhong and Ng, 2010) in our experiments. IMS is a supervised WSD system based on support vector machines (SVM). This WSD system comes with outof-the-box pre-trained models. However, since the original training set is not released, we use our own training set (see Section 2) to train IMS and then evaluate it on standard WSD and WSI benchmarks. This section presents the results obtained on four WSD and one WSI shared tasks. The four all-words WSD shared tasks are SensEval-2 (Edmonds and Cotton, 2001), SensEval-3 task 1 (Snyder and Palmer, 2004), and both the fine-grained task 17 and coarse-grained task 7 of SemEval-2007 (Pradhan et al., 2007;Navigli et al., 2007). These all-words WSD shared tasks provide no training data to the participants. The selected word sense induction task in our experiments is 5 http://wordnet.princeton.edu/wordnet/download/currentversion/ SemEval-2013 task 13 (Jurgens and Klapaftis, 2013).

WSD All-Words Tasks
The results of our experiments on WSD tasks are presented in Table 4. For the SensEval-2 and SensEval-3 test sets, we use the training set with the WordNet 1.7.1 sense inventory and for the SemEval-2007 test sets, we use training data with the WordNet 2.1 sense inventory.
In Table 4, IMS (original) refers to the IMS system trained with the original training instances as reported in (Zhong and Ng, 2010). We also compare our systems with two other configurations obtained from training IMS on SEMCOR, and SEM-COR plus DSO datasets. In Table 4, these two settings are shown by IMS (SC) and IMS (SC+DSO), respectively. Finally, Rank 1 and Rank 2 are the top two participating systems in the respective allwords tasks.
As shown in Table 4, our systems (both with and without the DSO corpus as training instances) perform competitively with and in some cases even better than the original IMS and also the best shared task submissions. This shows that our training set is of high quality and training a supervised WSD system using our training data achieves state-of-the-art results on the all-words tasks. Since the MUN dataset does not cover all target word types in the all-words shared tasks, the accuracy achieved with MUN alone is lower than the SC and SC+DSO settings. However, the evaluation results show that IMS trained on MUN alone often performs better than or is competitive with the WordNet Sense 1 baseline. Finally, it can be seen that adding the training instances from MUN (that is, IMS (MUN+SC) and IMS (MUN+SC+DSO)) often achieves higher accuracy than without MUN instances (IMS (SC) and IMS (SC+DSO)).

SemEval-2013 Word Sense Induction Task
In order to evaluate our system on a word sense induction task, we selected SemEval-2013 task 13, the latest WSI shared task. Unlike most other tasks that assume a single sense is sufficient for representing word senses, this task allows each instance to be associated with multiple sense labels with their applicability weights. This WSI task considers 50 lemmas, including 20 nouns, 20 verbs, and 10 adjectives, annotated with the WordNet 3.   1.7.1). The size column shows the total size of each dataset in megabytes or gigabytes.
sense inventory. We use WordNet 3.0 in our experiments on this task. We evaluated our system using all measures used in the shared task. The results are presented in Table 5. The columns in this table denote the scores of the various systems according to the different evaluation metrics used in the WSI shared task, which are Jaccard Index, K sim δ , WNDCG, Fuzzy NMI, and Fuzzy B-Cubed. See (Jurgens and Klapaftis, 2013) for details of the evaluation metrics.
This table also includes the top two systems in the shared task, AI-KU (Baskaya et al., 2013) and Unimelb (Lau et al., 2013), as well as Wang-15 (Wang et al., 2015). AI-KU uses a language model to find the most likely substitutes for a target word to represent the context. The clustering method used in AI-KU is K-means and the system gives good performance in the shared task. Unimelb relies on Hierarchical Dirichlet Process (Teh et al., 2006) to identify the sense of a target word using positional word features. Finally, Wang-15 uses Latent Dirichlet Allocation (LDA) (Blei et al., 2003) to model the word sense and topic jointly. This system obtains high scores, according to Fuzzy B-Cubed and Fuzzy NMI measures. The last three rows are some baseline systems: grouping all instances into one cluster, grouping each instance into a cluster of its own, and assigning the most frequent sense in SEM-COR to all instances. As shown in Table 5, training IMS on our training data outperforms all other systems on three out of five evaluation metrics, and performs competitively on the remaining two metrics.
IMS trained on MUN alone (IMS (MUN)) outperforms IMS (SC) and IMS (SC+DSO) in terms of the first three evaluation measures, and achieves comparable Fuzzy NMI and Fuzzy B-Cubed scores. Moreover, the evaluation results show that IMS (MUN) often performs better than the SEMCOR most frequent sense baseline. Finally, it can be observed that in most cases, adding training instances from MUN significantly improves IMS (SC) and IMS (SC+DSO).

Conclusion
One of the major problems in building supervised word sense disambiguation systems is the training data acquisition bottleneck. In this paper, we semi-automatically extracted and sense-tagged an English corpus containing one million sensetagged instances. This large sense-tagged corpus can be used for training any supervised WSD systems. We then evaluated the performance of IMS trained on our sense-tagged corpus in several WSD and WSI shared tasks. Our sense-tagged dataset has been released publicly 6 . We believe our dataset is the largest publicly available annotated dataset for WSD at present.
After training a supervised WSD system using our training set, we evaluated the system using standard benchmarks. The evaluation results show that our sense-tagged corpus can be used to build a WSD system that performs competitively with the  Table 5: Supervised and unsupervised evaluation results (in %) on SemEval-2013 word sense induction task top performing WSD systems in the SensEval-2, SensEval-3, and SemEval-2007 fine-grained and coarse-grained all-words tasks, as well as the top systems in the SemEval-2013 WSI task.