Clustering by Committee

CBC (Clustering by Committee) is both a clustering algorithm and a resulting knowledge collection created by Patrick Pantel and Dekang Lin at the University of Alberta. The algorithm is a general-purpose partitioning clustering algorithm. The authors have used it more specifically for automatically clustering documents and for automatically inducing concepts and word senses.

The CBC knowledge collection consists of concepts, which are clustered instances like the three shown below along with a template of typical grammatical contexts (lexical co-occurrence vectors) extracted from a textual corpora:

(A) multiple sclerosis, diabetes, osteoporosis, cardiovascular disease, Parkinson's, rheumatoid arthritis, heart disease, asthma, cancer, hypertension, lupus, high blood pressure, arthritis, emphysema, epilepsy, cystic fibrosis, leukemia, hemophilia, Alzheimer, myeloma, glaucoma, schizophrenia, ...
(B) Mike Richter, Tommy Salo, John Vanbiesbrouck, Curtis Joseph, Chris Osgood, Steve Shields, Tom Barrasso, Guy Hebert, Arturs Irbe, Byron Dafoe, Patrick Roy, Bill Ranford, Ed Belfour, Grant Fuhr, Dominik Hasek, Martin Brodeur, Mike Vernon, Ron Tugnutt, Sean Burke, Zach Thornton, Jocelyn Thibault, Kevin Hartman, Felix Potvin, ...
(C) pink, red, turquoise, blue, purple, green, yellow, beige, orange, taupe, white, lavender, fuchsia, brown, gray, black, mauve, royal blue, violet, chartreuse, teal, gold, burgundy, lilac, crimson, garnet, coral, grey, silver, olive green, cobalt blue, scarlet, tan, amber, ...

Using sets of representative elements, called committees, CBC discovers concept signatures that unambiguously describe the members of a possible concept (e.g. diseases, hockey goalies, and colors). Concept signatures are templates of grammatical relations that apply to most of the instances of the concept (lexical co-occurrence vectors). The algorithm initially discovers committees that are well scattered in the similarity space. It then proceeds by assigning words to their most similar committees, each of which represents a final cluster. After assigning a word to a committee, CBC removes their overlapping features (syntactical co-occurrences) from the word before assigning it to another committee. This allows CBC to discover the less frequent senses of a word and to avoid discovering duplicate senses.

On the task of recovering the concepts and word senses in WordNet, CBC achieved 61% precision and 51% recall. CBC outputs a flat list of concepts (i.e., there is no hierarchical information).

Acquiring the Resource

Both an implementation of the CBC algorithm and the CBC knowledge collection is available for research purposes by contacting its authors.

Demos

CBC Search Engine

References

Please refer to either of the following publications when using this resource:

Patrick Pantel. 2003. Clustering by Committee. Ph.D. Dissertation. Department of Computing Science, University of Alberta.
Patrick Pantel and Dekang Lin. 2002. Discovering Word Senses from Text. In Proceedings of ACM Conference on Knowledge Discovery and Data Mining (KDD-02). pp. 613-619. Edmonton, Canada.

Authors

Patrick Pantel

Dekang Lin 北京万达火车票预定中心

火车票订火车票北京火车票火车票预定火车票预订火车票查询北京火车票预定北京火车票查询北京火车票预订火车票订火车票北京火车票火车票预定火车票预订火车票查询北京火车票预定北京火车票查询北京火车票预订搬场搬家上海搬场上海搬场公司上海搬场搬家公司上海搬家公司上海搬家婚庆婚庆公司婚庆网搜索引擎优化网络营销

Clustering by Committee

Contents

Acquiring the Resource

Demos

References

Authors

Navigation menu

Search