[ Jump to the site navigation menu. ] [ Jump to the site search. ]


Tutorial 4: From Web Content Mining to Natural Language Processing

Bing Liu


Web mining is a growing research area. It consists of Web usage mining, Web structure mining, and Web content mining. Web usage mining refers to the discovery of user access patterns from Web usage logs. Web structure mining tries to discover useful knowledge from hyperlinks. Web content mining aims to extract/mine useful information or knowledge from Web page contents. This tutorial focuses on Web content mining and its extensive connection with natural language processing (NLP).

In the past few years, there was a rapid expansion of activities in Web content mining. This is not surprising because of the huge amount of valuable information of almost any imaginable type on the Web and significant economic benefits of such mining. However, due to the heterogeneity and the lack of structure of the Web data, automated discovery of targeted or unexpected knowledge/information still presents many challenging problems. This tutorial will introduce several such problems and some state-of-the-art techniques for dealing with them, e.g., data/information extraction, Web information integration, opinion mining, and information synthesis. These problems all have strong connections with NLP. In the tutorial, I will pay special attention to such connections and discuss how NLP researchers may contribute towards solving these problems. Many real-life examples will also be given to help participants understand research concepts and see how the technologies may be deployed to real-life applications. The tutorial will thus have a mix of research and industry flavor, addressing seminal research ideas and looking at the technology from an industry angle.

Tutorial Outline

  1. Introduction
    • Web content mining
    • Opportunities and challenges
  2. Data extraction
    • The problem
    • Wrapper induction
    • Automated extraction
    • Using language clues
  3. Information integration
    • The Problem
    • Schema matching as synonym discovery
    • Linguistic based approaches
    • Phrase correlation based approach
    • Instance-based approach
  4. Opinion mining
    • User generated content on the Web
    • Sentiment classification
    • Opinion extraction and summarization
    • Comparative opinions
    • Opinion search and spam
  5. Information synthesis
    • The problem
    • Exploiting information redundancy
    • Using clustering
    • Using existing text organization
    • Using syntactic patterns
    • Integrating information
    • Need for more NLP
  6. Web page pre-processing
  7. Some other topics
  8. Summary

Bing Liu is an associate professor at the Department of Computer Science, University of Illinois at Chicago. He received his PhD in Artificial Intelligence from the University of Edinburgh. His research interests include data mining, Web mining and text mining. He has published extensively in these areas in leading conferences, e.g., KDD, WWW, AAAI, IJCAI, ICML and SIGIR. Recently, he also published a textbook entitled "Web Data Mining". Since 2003, he has been working on Web mining and text mining, in particular, data extraction and opinion mining, and has given several invited talks on the topics, including one at the COLING/ACL-06 Workshop on Sentiment and Subjectivity in Text. On professional services, he has served as program chairs and vice chairs of several conferences, and an associate editor for IEEE Transactions on Knowledge and Data Engineering. He is currently an associate editor of SIGKDD Explorations and is on the editorial boards of three other journals. Additional information about him can be found at http://www.cs.uic.edu/~liub.


[ Jump to the content. ]

Organized by:

ACL UFAL

[ Jump to the content. ] [ Jump to the site navigation menu. ] [ Jump to the site search. ]


Webmasters: Zlatka Subrova and Juraj Simlovic. Page content: Joakim Nivre.
Site is valid XHTML 1.0 and valid CSS. Maintained with TED Notepad and Vim.
Disclaimers. All rights reserved. Access counter: 467040.