Word sense disambiguation
Word Sense Disambiguation (WSD) is the process of identifying the sense of a polysemic word.
In modern WSD systems, the senses of a word are typically taken from some specified dictionary. These days WordNet is the usual dictionary in question. WSD has been investigated in computational linguistics as a specific task for well over 40 years, though the acronym is newer. The SENSEVAL conferences have attempted to put Word Sense Disambiguation on an empirically measurable basis by hosting evaluations in which a given corpus of tagged word senses are created using WordNet's senses and participants attempt to recognize those senses after tuning their systems with a corpus of training data.
One of the first problems that is encountered by any natural language processing system is that of lexical ambiguity, be it syntactic or semantic. The resolution of a word's syntactic ambiguity has largely been solved in language processing by part-of-speech taggers which predict the syntactic category of words in text with high levels of accuracy. <ref>E. Brill. Transformation-based error-driven learning and natural language processing: A case study in part of speech tagging. Computational Linguistics, 21(4):543-566, December 1995</ref> The problem is that words often have more than one meaning, sometimes fairly similar and sometimes completely different. The meaning of a word in a particular usage can only be determined by examining its context. This is, in general, a trivial task for the human language processing system, for example consider the following two sentences, each with a different sense of the word bank :
- The boy leapt from the bank into the cold water.
- The van pulled up outside the bank and three masked men got out.
We immediately recognise that in the first sentence bank refers to the edge of a river and in the second to a building. However, the task has proved to be difficult for computer and some have believed that it would never be solved. An early sceptic was Bar-Hillel who famously proclaimed that "sense ambiguity could not be resolved by electronic computer either current or imaginable". <ref>Y. Bar-Hillel. Language and Information. Addison-Wesley, 1964.</ref> However, the situation is not as bad as Bar-Hillel feared, there have been several advances in word sense disambiguation and it is now at a stage where lexical ambiguity in text can be resolved with a reasonable degree of accuracy.
The problem of WSD was first introduced by Warren Weaver in 1949 <ref>W. Weaver. 1949. Translation. In Machine Translation of Languages: Fourteen Essays, ed. by Locke, W.N. and Booth, A.D. Cambridge, MA: MIT Press.</ref>. In 1975 Kelly and Stone <ref>E.F. Kelly and P.J. Stone. 1975. Computer Recognition of English Word Senses, Amsterdam: North-Holland.</ref> published a book explicitly listing their rules for disambiguation of word senses. As large-scale lexical resources became available in the 1980s, the automatic extraction of lexical knowledge became possible, disambiguation was still knowledge- or dictionary - based though. With the rise of statistical methods in CL in the 1990s, WSD became one of the main focus' of supervised learning techniques.
Under this approach disambiguation is carried out using information from an explicit lexicon or knowledge base. The lexicon may be a machine readable dictionary, thesaurus or it may be hand-crafted. This is one of most popular approaches to word sense disambiguation and amongst others, work has been done using existing lexical knowledge sources such as WordNet <ref>Agirre, E. and Rigau, G. (1996). Word sense disambiguation using conceptual density. In Proceedings of COLING'96</ref> and LDOCE <ref>J. Guthrie, L. Guthrie, Y. Wilks and H. Aidinejad, Subject-Dependent Co-Occurrence and Word Sense Disambiguation, ACL-91, pp. 146-152.</ref>. The information in these resources has been used in several ways, for example Wilks and Stevenson <ref>Y. Wilks and M. Stevenson. The Grammar of Sense: using part-of-speech tags as a first step in semantic disambiguation. To appear in Journal of Natural Language Engineering, 4(3).</ref> use large lexicons (generally machine readable dictionaries) and the information associated with the senses (such as part-of-speech tags, topical guides and selectional preferences) to indicate the correct sense. Another approach is to treat the text as an unordered bag of words where similarity measures are calculated by looking at the semantic similarity (as measured from the knowledge source) between all the words in the window regardless of their positions, as was used by Yarowsky <ref>D. Yarowsky. Word-sense disambiguation using statistical models of Roget's categories trained on large corpora. In Proceedings of the 14th International Conference on Computational Linguistics (COLING-92), pages 454-460, Nantes, France, 1992.</ref>.
This approach attempts to disambiguate words using information which is gained by training on some corpus, rather that taking it directly from an explicit knowledge source. This training can be carried out on either a disambiguated or raw corpus, where a disambiguated corpus is one where the semantics of each polysemous lexical item is marked and a raw corpus one without such marking.
This set of techniques requires a training corpus which has already been disambiguated. In general a machine learning algorithm of some kind is applied to certain features extracted from the corpus and used to form a representation of each of the senses. This representation can then be applied to new instances in order to disambiguate them. Different researchers have made use of different sets of features, for example local collocates such as first noun to the left and right, second word to the left/right and so on. However, a more common feature set is to take all the words in a window of words around the ambiguous words, treating the context as an unordered bag of words. Another approach is to use Hidden Markov Models which have proved very successful in part-of-speech tagging. Realizing of course that semantic tagging is a much more difficult problem than part-of-speech tagging, <ref>Segond, F., Schiller, A., Grefenstette, G., and Chanod, J. (1997). An experiment in semantic tagging using hidden markov model tagging. In Vossen, P., Adriaens, G., Calzolari, N., Sanfilippo, A., and Wilks, Y., editors, Proceedings of the ACL/EACL'97 Workshop on Automatic Information Extraction and Building of Lexical Semantic Resources.</ref> decided nonetheless to perform an experiment to see how well words can be semantically disambiguated using techniques that have proven to be effective in part-of-speech tagging. This experiment involved the following steps:
- deriving a lexicon from the WordNet data files which contains all possible semantic tags for each noun, adjective, adverb and verb. Words having no semantic tags (determiners, prepositions, auxiliary verbs, etc.) are ignored.
- constructing a training corpus and a test corpus from the semantically tagged Brown corpus (manually tagged by the WordNet team) by extracting tokens for the HMM bigrams.
- computing a HMM model based on the training corpus, runnig the tagger on the test corpus and comparing the results with the original tags in the test corpus.
The general problem with these methods is their reliance on disambiguated corpora which are expensive and difficult to obtain. This has meant that many of these algorithms have been tested on very small numbers of different words, often as few as 10.
It is often difficult to obtain appropriate lexical resources (especially for texts in a specialized sublanguage). This lack of resources has led several researchers to explore the use of unannotated, raw, corpora to perform unsupervised disambiguation. It should be noted that unsupervised disambiguation cannot actually label specific terms as a referring to a specific concept: that would require more information than is available. What unsupervised disambiguation can achieve is word sense discrimination, it clusters the instances of a word into distinct categories without giving those categories labels from a lexicon (such as WordNet synsets). An example of this is the dynamic matching technique<ref>Radford et al. (1996)</ref> which examines all instances of a given term in a corpus and compares the contexts in which they occur for common words and syntactic patterns. A similarity matrix is thus formed which is subject to cluster analysis to determine groups of semantically related instances of terms. Another example is the work of Pedersen <ref>T. Pedersen and R. Bruce. Distinguishing word senses in untagged text. In Proceedings of the Second Conference on Empirical Methods in Natural Language Processing, Providence, RI, August 1997.</ref> who compared three different unsupervised learning algorithms on 13 different words. Each algorithm was trained on text with was tagged with either the WordNet or LDOCE sense for the word but the algorithm had no access to the truce senses. What it did have access to was the number of senses for each word and each algorithm split the instances of each word into the appropriate number of clusters. These clusters were then mapped onto the closest sense from the appropriate lexicon. Unfortunately the results are not very encouraging, Pedersen reports 65-66% correct disambiguation depending on the learning algorithm used. This result should be compared against that fact that, in the corpus he used, 73% of the instances could be correctly classified by simply choosing the most frequent sense.
These approaches can be neither properly classified as knowledge or corpus based but use part of both approaches. A good example of this is Luk's system <ref>A. Luk. Statistical sense disambiguation with relatively small corpora using dictionary definitions. In Proceedings of the 33rd Meetings of the Association for Computational Linguistics (ACL-95), pages 181-188, Cambridge, M.A., 1995.</ref> which uses the textual definitions of senses from a machine readable dictionary (LDOCE) to identify relations between senses. He then uses a corpus to calculate mutual information scores between these related senses in order to discover the most useful. This allowed Luk to produce a system which used the information in lexical resources as a way of reducing the amount of text needed in the training corpus. Another example of this approach is the unsupervised algorithm of Yarowsky <ref>D. Yarowsky. Unsupervised word-sense disambiguation rivaling supervised methods. In Proceedings of the 33rd Annual Meeting of the Association for Computational Lainguistics (ACL '95), pages 189-196, Cambridge, MA, 1995.</ref>. This takes a small number of seed definitions of the senses of some word (the seeds could be WordNet synsets or definitions from some lexicon) and uses these to classify the "obvious" cases in a corpus. Decision lists <ref>R. Rivest. Learning decision lists. Machine Learning, 2(3):229-246, 1987</ref> are then used to make generalisations based on the corpus instances classified so far and these lists are then re-applied to the corpus to classify more instances. The learning proceeds in this way until all corpus instances are classified. Yarowsky reports that the system correctly classifies senses 96% of the time.
Word Sense Disambiguation has several debates within the field as to whether the senses offered in existing dictionaries are adequate to distinguish the subtle meanings used in text contexts and how to evaluate the overall performance of a WSD system. For example, does it make sense to describe an overall percentage accuracy for a WSD system or does evaluation require specific comparison of system performance on a word by word basis.