Tutorial 5: Mining Unstructured Data

NOTE: THIS TUTORIAL HAS BEEN CANCELLED

Ronen Feldman

The information age has made it easy to store large amounts of data. The proliferation of documents available on the Web, on corporate intranets, on news wires, and elsewhere is overwhelming. However, while the amount of dat a available to us is constantly increasing, our ability to absorb and process this information remains constant. Search engines only exacerbate the problem by making more and more documents available in a matter of a few key strokes. Text Mining is a new and exciting research area that tries to solve the information overload problem by using techniques from data mining, machine learning, NLP, IR and knowledge management. Text Mining involves th e preprocessing of document collections (text categorization, information extraction, term extraction), the storage of the intermediate representations, the techniques to analyze these intermediate representations (distribution analysis, clustering, trend analysis, association rules etc) and visualization of the results. In this tutorial w e will present the general theory of Text Mining and will demonstrate several systems that use these principles to enable interactive exploration of larg e textual collections. We will present a general architecture for text mining and will outline the algorithms and data structures behind the systems. Special emphasis will be given to efficient algorithms for very large document collections, tools for visualizing such document collections, the use of intelligent agents to perform text mining on the internet, and the use information extraction to better capture the major themes of the documents. The Tutorial will cover the state of the art in this rapidly growing area of research. Several real world applications of text mining will be presented.

AUDIENCE

The tutorial should be of interest to practitioners from NLP, IR, Data Mining, Bio Information, Knowledge Management interested in this fast-growing research area.

Prerequisites

The tutorial is suitable to the general audience. No special knowledge in needed as the tutorial is self-contained. The tutorial level is intermediate.

Coverage

The tutorial will cover the state of the art in Text Mining and Information Extraction. The tutorial will be broad in nature, but we will get deeper on several topics. These topics will include Text Mining System's architecture , preprocessing techniques such as part of speech tagging, zoning, morphological analysis and shallow parsing.

TUTORIAL OUTLINE

  • Introduction to Text Mining
    • The need for Text Mining
    • What is unique about Text Mining
    • Term Extraction
    • Introduction to Information Extraction
  • A general architecture for Text Mining
    • Browsing
    • Text Analytics
      • Taxonomy Construction and Refinement
      • Comparing Distributions
      • Trend Analysis
      • Isolating interesting patterns
      • Association Generation
    • Text Mining Query Languages
    • Visualizations
  • Information Extraction in Depth
    • Entity, Fact and Event Extraction
    • Pre Processing Techniques
    • Types of IE systems
      • Rule Based Systems
      • Machine Learning Based Systems
        • Boot Strapping Approaches
        • Mutual Boot Strapping
        • Multi Class Boot Strapping
        • Classic HMM models
        • Bigram models
        • Creating hybrids between ML and RB systems
        • Classification based IE
      • Unsupervised Learning
        • The KnowItAll approach
        • KnowItNow
        • The URES System
    • Anaphora Resolution
    • Environments for Creating IE Systems
    • Evaluation of IE Systems
  • Applications of Text Mining
    • Financial Applications
    • Military Applications
    • Information Extraction for Bioinformatics
    • Competitive Intelligence Applications

Ronen Feldman is a senior lecturer at the Mathematics and Computer Science Department of Bar-Ilan University in Israel, and the Director of the Data Mining Laboratory. He received his B.Sc. in Math, Physics and Computer Science from the Hebrew University, M.Sc. in Computer Science from Bar-Ilan University, and his Ph.D. in Computer Science from Cornell University in NY . He was an Adjunct Professor at NYU Stern Business School. He is the founder of ClearForest Corporation, a Boston based company specializing in development of text mining tools and applications. He has given more than 3 0 tutorials on text mining and information extraction and authored numerous papers on these topics. He just finished writing his book "The Text Mining Handbook" to be published by Cambridge University Press in early 2006.