Finite-state Language Processing
Finite-state technology is becoming an
invaluable tool for various levels of language processing. It is the
computational means of choice for describing the phonology, lexicon
and morphology of natural languages, but is used more and more for
other purposes as well, including (shallow) parsing, word-level
translation, named entity recognition, etc.
The tutorial will provide an introduction to the
technology and its many applications in natural language
processing. It starts with the very basics of finite-state devices and
regular expressions and concludes with a sketch of how to design and
implement a large-scale project. Several examples of real applications
illustrate the formal material.
- Finite-state automata (FSA)
- Regular expressions
- Operations on automata
- Applications of FSA in NLP
- Storing lexicons
- Regular relations
- Finite-state transducers (FSTs)
- Properties of FSTs
- Applications of FSTs in NLP
- Morphological analysis
- Part of speech tagging
- Translation dictionaries
- Extended regular expression languages
- Replace rules and composition
- Morphological analysis and generation
- Shallow parsing
- Available tools
This tutorial is designed for computer scientists and
linguistics alike. Acquaintance with basic formal language theory and
knowledge of some programming language will be useful, but not
Shuly Wintner is an assistant professor in the
Department of Computer Science at the University of Haifa, Israel. His
research involves adaptation of computer science techniques and
paradigms to computational linguistics, with an emphasis on formal
grammars and finite-state devices.
What's New in Statistical Machine Translation
Kevin Knight and Philipp Koehn
Accurate translation requires a great deal of knowledge
about the usage and meaning of words, the structure of phrases, the
meaning of sentences, and which real-life situations are
plausible. Recently, there has been a fair amount of research into
extracting translation-relevant knowledge automatically from large
collections of manually-translated texts, and over the past years,
several statistical MT projects have appeared in North America,
Europe, and Asia, and the literature is growing substantially. We will
overview this progress.
- Data for MT.
- Bilingual corpora: what's out there?
- Acquisition and cleaning.
- What does three million words really mean?
- MT Evaluation.
- Manual and automatic.
- Core Models and Decoders
- IBM Models 1-5 and HMM models, training, decoding.
- Word alignment and its evaluation.
- Phrase models.
- Syntax-based translation and language models.
- Specialized Models.
- Named entity MT, numbers and dates, morphology, noun phrase MT.
- Available Resources.
- tools and data.
The target audience for this tutorial is anyone
interested in machine translation of human languages.
Kevin Knight is a Senior Research Scientist at the
USC/Information Sciences Institute and a Research Associate Professor
in the Computer Science Department at USC. He has written a number of
articles on statistical MT, plus a widely-circulated MT workbook
(http://www.isi.edu/natural-language/mt/wkbk.rtf). Dr. Knight has
given several invited talks on machine translation at recent AMTA and
Philipp Koehn completed his Ph.D. in Computer Science at
the University of Southern California in Fall 2003. He has written a
number of articles on topics in statistical machine translation,
including bilingual lexicon induction from monolingual corpora,
word-level translation models, and translation with scarce
resources. He has also worked at AT&T Laboratories on text-to-speech
systems, and at WhizBang! Labs on text categorization.
Semantic Inference for Question Answering
Sanda Harabagiu and Srini Narayanan
The AQUAINT QA program has provided solid evidence that
potential users of QA systems appear to have limited need for factoid
question answering, but rather much more need to have systems that can
deal with complex reasoning about causes, effects, chains of
hypotheses and so on -- capabilities that current systems do not
adequately support. Approaching this goal requires combining
sophisticated systems for knowledge representation and inference with
methods to extract such deep semantic relations from linguistic
input. We believe that recent important advances in knowledge
representation and inference, the widespread availability of
semantically motivated resources such as WordNet, FrameNet, and
successful recent efforts at textual analysis including
predicate-argument extraction point the way to building the next
generation of semantically rich QA systems. This tutorial will serve
as a survey of important recent progress on semantically-based QA
articulating connections and highlighting efforts that have brought
one or more of these techniques to bear on QA system design and
- Methods to extract semantic relations from text.
- Statistical techniques.
- Knowledge intensive techniques.
- Supervised and unsupervised learning techniques.
- Knowledge representation and inference techniques for QA.
- Logical inference Methods.
- Structured Probabilistic Methods.
- Probabilistic Relational Models for inference with uncertainty.
- Models of Event Structure.
- Ontologies and Linguistic resources for QA.
- Linguistic resources.
- Ontologies and resources on the SemanticWeb.
This tutorial is designed for computer scientists and
linguistics alike. Acquaintance with statistical techniques and
Knowledge Representation will be useful, but not mandatory.
Dr. Sanda Harabagiu is an Associate Professor and the
Erik Jonsson School Research Initiation Chair in the Department of
Computer Science at University of Texas at Dallas. She earned her Phd
in Computer Engineering from University of Southern California and a
Research Doctorate from Tor Vergata University in Rome, Italy. Her
research interests are in the area of Question Answering, Information
Extraction, Reference Resolution and Text Summarization.
Dr. Srini Narayanan is a Senior Research Scientist at
the International Computer Science Institute (ICSI), Berkeley where he
is a co-PI with the NTL (http://www.icsi.berkeley.edu/NTL)
and FrameNet (http://www.icsi.berkeley.edu/~framenet)
projects. He obtained his PhD in Computer Science from the University
of California, Berkeley in 1998. His research interests include
computational semantics and metaphor, probabilistic dynamic models,
and computational neuroscience.
Graphical Models in Speech and Language Research
Graphical models (GMs) are a general statistical
abstraction that can be used to describe a wide variety of problem
domains. Recently, significant research has occurred on their
application to speech and language processing. GMs offer a
mathematically formal but widely flexible means for solving many of
the problems encountered in these fields. Because of their
generality, GMs make it possible to rapidly go from novel idea to
working implementation. In this advanced tutorial, we will survey how
GMs can be used to represent structures and models in speech and
We start with concepts and notation, including an
inspection of different forms of graphical models, and some intriguing
constructs these forms make available. This includes the notion of a
"switching network", where one portion of a network might determine
the existence of another, "sparse dependencies", where many
combinations of variable values are forced to have zero probability,
and "child observations", where influence can flow in the opposite
direction of a directed edge in a graph. We will in general see how
GMs can be viewed as a mathematically formal visual language, offering
a precise set of primitives for specifying statistical systems. We
will continue with an analysis of algorithms for performing
probabilistic inference on graphs, concentrating on both theory (e.g.,
when is inference tractable) and practice (data structures and
implementation). We will give special attention to the challenges
that arise when the underlying domain is temporal.
Next, we will examine the ways GMs can represent speech
and language. This will include explicit representations of
hierarchical and temporal phenomena such as parameter sharing,
multi-stream models with varying degrees of asynchrony, and classifier
combination. We will see how these can be used to represent speech
evolution in terms of both phonology and articulation. We will also
cover graphical representations of language, including explicit
structures for N-grams, interpolation, skipping, hierarchical classes,
smoothing, back-off, factored representations, and other
forms. Furthermore, we will investigate how to describe statistical
machine translation via novel multi-dynamic graph representations.
While graphs not only can represent many well-known
statistical models, with only minor graph adjustments they can also
represent very different (and potentially novel) systems. We will
observe how deterministic dependencies, switching networks, and child
observations greatly facilitate this phenomenon. Moreover, we will see
how a graph's associated inferential machinery can shield a user from
needing to "reinvent the wheel" each time it is desired to investigate
a new model.
Lastly, we will briefly survey available GM toolkits and
their features. We will include a comparison of GM technology with its
modern alternatives. Tutorial attendees will thus learn not only how
to use GMs, but also how to decide when and where GM technology is
- Overview and Motivation.
- Different GM types, constructs, and structures.
- Theory and practice of probabilistic inference in Dynamic GMs.
- Explicit representations of temporal structures.
- Graphical models of speech.
- Graphical models of language.
- Graphical models of statistical machine translation.
- GM Toolkits.
- GM technology vs. its alternatives.
This tutorial will assume a basic knowledge of standard
language and speech processing, including knowledge of hidden Markov
models, maximum entropy models, and the many techniques that go into
making such models successful. It will also be assumed that the
audience is comfortable with basic statistical terminology.
Jeff A. Bilmes is an Assistant Professor in the
Department of Electrical Engineering at the University of Washington,
Seattle (adjunct in Linguistics and in Computer Science and
Engineering). He co-founded the Signal, Speech, and Language
Interpretation Laboratory at the University. He received a masters
degree from MIT, and a Ph.D. in Computer Science at the University of
California in Berkeley. Jeff is an author of the graphical models
toolkit (GMTK), and was a leader of the 2001 Johns Hopkins summer
workshop team applying graphical models to speech and language. His
primary research lies in statistical graphical models, speech,
language and time series processing, human-computer interfaces, and
probabilistic machine learning.
Large Scale Spoken Document Retrieval
Pedro J. Moreno and Jean Manuel Van Thong
Search engines like Google or Yahoo have been extremely
successful over the years in facilitating the search and retrieval of
text pages and written documents. However, only recently these
technologies have been extended to spoken documents. While there are
many similarities with standard text search engines, spoken document
retrieval is sufficiently different.
In this tutorial we provide an introduction to the field
of spoken document retrieval with an special emphasis on large audio
collections. We will start with a general introduction to speech
recognition, then continue with various approaches to audio indexing
and then continue with a global description of the architecture needed
for large scale indexing. We will conclude with several demos of
existing engines and technologies.
- Extracting metadata from raw audio.
- Fundamentals of speech recognition.
- Acoustic modeling.
- Words based, phone based.
- Language modeling
- Speech recognition approaches for audio indexing.
- Phonetic search.
- Word spotting approaches.
- Large vocabulary speech recognition.
- Syllable based speech recognition.
- Limitations and advantages of all approaches.
- The out-of-vocabulary (OOV) problem.
- Text audio alignment.
- Indexing and searching metadata.
- Searching versus indexing.
- Content segmentation.
- Modification to text indexing, long documents vs. short documents.
- Index fusion approaches.
- Acoustic search versus semantic search.
- Architecture design for large scale indexing.
- The web search model for audio indexing.
- Audio (and video) crawling.
- Audio to text transcription.
- Index construction.
- API's for querying and index update.
- The user interface design.
- Putting everything together.
- Demos of several systems.
- Conclusions: Where is audio indexing headed?
This tutorial is designed for information retrieval and
computer scientists with no previous knowledge of speech recognition
and information retrieval.
Pedro J. Moreno is a senior researcher at the Cambridge
Research Lab, which is part of Hewlett-Packard Labs. His main
interests are in the practical applications of machine learning
techniques in several fields such as as audio indexing, image
retrieval, text classification and noise robustness. Dr. Moreno has
being involved in the design of HP Labs audio indexing engine
SpeechBot. Lately his main interests are in the areas of
bioinformatics and bio signal interpretation.
JM Van Thong is a senior researcher at the Cambridge
Research Lab, which is part of Hewlett-Packard Labs. His current
research interests are bioinformatics, media indexing, and information
retrieval systems as well as user interfaces. During his 17 years
spent in research, JM has been involved in several successful projects
including SpeechBot, the first large scale web audio indexing system,
RedBot, a web-based tool for automatic red-eye correction, an
information retrieval system for hand-helds, a real-time streaming
phoneme recognizer for a facial animation package, and planar maps
technology for a sketching software.
Statistical Language Models and Information Retrieval
Statistical language models play an important role in
virtually all kinds of tasks involving human language technologies.
In particular, they have been attracting much attention recently in
the information retrieval community due to their theoretical and
empirical advantages over traditional retrieval methods. A great deal
of recent work has shown that statistical language models not only
lead to superior empirical performance, but also facilitate parameter
tuning, open up possibilities for modeling non-traditional retrieval
problems, and in general provide a more principled way of modeling
The purpose of this tutorial is to systematically review
the recent progress in applying statistical language models to
information retrieval with an emphasis on the underlying principles
and framework, empirically effective language models, and language
models developed for non-traditional retrieval tasks. Tutorial
attendees can expect to learn the major principles and methods of
applying statistical language models to information retrieval, the
outstanding problems in this area, as well as obtain comprehensive
pointers to the research literature.
- Information Retrieval (IR)
- Statistical Language Models (SLMs)
- Applications of SLMs to IR
- The Basic Language Modeling Approach
- Query likelihood methods and their justification
- Smoothing of language models
- Improving the basic language modeling approach
- Feedback Language Models
- Different ways of feedback with language models
- Representative feedback models (relevance/query models, translation models)
- Language Models for different retrieval tasks
- Cross-language retrieval
- Distributed information retrieval
- TDT and information filtering
- Semi-structured information retrieval
- Subtopic retrieval
- A General Framework for Applying SLMs to IR
- SLMs vs. traditional methods: Pros & Cons
- Progress so far
- Challenges and future research directions
The tutorial should appeal to both people working on
information retrieval with an interest in applying more advanced
language models and those who have a background on statistical
language models and wish to apply them to information
retrieval. Attendees will be assumed to know basic probability and
ChengXiang Zhai is an Assistant Professor of Computer
Science at the University of Illinois at Urbana-Champaign. He
received a Ph.D. in Computer Science from Nanjing University in 1990,
and a Ph.D. in Language and Information Technologies from Carnegie
Mellon University in 2002. He worked at Clairvoyance Corp. as a
Research Scientist and, later, a Senior Research Scientist from 1997
to 2000. His research interests broadly include information retrieval,
natural language processing, machine learning, and bioinformatics. His
most recent work, including his dissertation, is centered on
developing formal retrieval frameworks and applying statistical
language models to text retrieval, especially in directions such as
personalized search and semi-structured information retrieval. He has
served on the program committee for ACM SIGIR 2003, ACM SIGIR 2004,
ACL 2003, ACM CIKM 2003. He is the IR program co-chair for ACM CIKM
2004. He is a recipient of the 2004 NSF CAREER award.