Clustering by Committee
CBC (Clustering by Committee) is both a clustering algorithm and a resulting knowledge collection created by Patrick Pantel and Dekang Lin at the University of Alberta. The algorithm is a general-purpose partitioning clustering algorithm. The authors have used it more specifically for automatically clustering documents and for automatically inducing concepts and word senses.
The CBC knowledge collection consists of concepts, which are clustered instances like the three shown below along with a template of typical grammatical contexts (lexical co-occurrence vectors) extracted from a textual corpora:
- (A) multiple sclerosis, diabetes, osteoporosis, cardiovascular disease, Parkinson's, rheumatoid arthritis, heart disease, asthma, cancer, hypertension, lupus, high blood pressure, arthritis, emphysema, epilepsy, cystic fibrosis, leukemia, hemophilia, Alzheimer, myeloma, glaucoma, schizophrenia, ...
- (B) Mike Richter, Tommy Salo, John Vanbiesbrouck, Curtis Joseph, Chris Osgood, Steve Shields, Tom Barrasso, Guy Hebert, Arturs Irbe, Byron Dafoe, Patrick Roy, Bill Ranford, Ed Belfour, Grant Fuhr, Dominik Hasek, Martin Brodeur, Mike Vernon, Ron Tugnutt, Sean Burke, Zach Thornton, Jocelyn Thibault, Kevin Hartman, Felix Potvin, ...
- (C) pink, red, turquoise, blue, purple, green, yellow, beige, orange, taupe, white, lavender, fuchsia, brown, gray, black, mauve, royal blue, violet, chartreuse, teal, gold, burgundy, lilac, crimson, garnet, coral, grey, silver, olive green, cobalt blue, scarlet, tan, amber, ...
Using sets of representative elements, called committees, CBC discovers concept signatures that unambiguously describe the members of a possible concept (e.g. diseases, hockey goalies, and colors). Concept signatures are templates of grammatical relations that apply to most of the instances of the concept (lexical co-occurrence vectors). The algorithm initially discovers committees that are well scattered in the similarity space. It then proceeds by assigning words to their most similar committees, each of which represents a final cluster. After assigning a word to a committee, CBC removes their overlapping features (syntactical co-occurrences) from the word before assigning it to another committee. This allows CBC to discover the less frequent senses of a word and to avoid discovering duplicate senses.
On the task of recovering the concepts and word senses in WordNet, CBC achieved 61% precision and 51% recall. CBC outputs a flat list of concepts (i.e., there is no hierarchical information).
Acquiring the Resource
Both an implementation of the CBC algorithm and the CBC knowledge collection is available for research purposes by contacting its authors.
Please refer to either of the following publications when using this resource:
- Patrick Pantel. 2003. Clustering by Committee. Ph.D. Dissertation. Department of Computing Science, University of Alberta.
- Patrick Pantel and Dekang Lin. 2002. Discovering Word Senses from Text. In Proceedings of ACM Conference on Knowledge Discovery and Data Mining (KDD-02). pp. 613-619. Edmonton, Canada.