< back to Tutorials

ARABIC NATURAL LANGUAGE PROCESSING

Nizar Habash

This tutorial provides NLP system developers/researchers with necessary background information for working with the Arabic language, which has recently become a focus of an increasing number of projects in computational linguistics.  The goal of the tutorial is to introduce Arabic linguistic phenomena that need to be addressed and review the state-of-the-art on Arabic processing. Alternative approaches are presented and contrasted for their value in different application contexts (e.g., information retrieval versus machine translation).

The tutorial has four sections.  First is a discussion of Arabic phonology and orthography with a focus on Arabic spelling peculiarities and their effect on Arabic processing.  Arabic encoding issues are also addressed.  Second, aspects of Arabic morphology are presented and explained.  This is followed by a survey of different approaches to address these phenomena.  Third, a survey of Arabic syntactic phenomena is presented and contrasted to English syntactic phenomena. Syntactic representation in the Penn Arabic Treebank is discussed. Finally, Arabic dialects and the kind of problems they present for Arabic NLP are presented. Links to recent publications and available toolkits/ resources for all four sections are provided.

This tutorial is designed for computer scientists and linguistics alike. Acquaintance with basic formal language theory and knowledge of some programming language will be useful, but not mandatory.

TUTORIAL OUTLINE

  1. Arabic Orthography
    • Phonology
    • Orthography
    • Encoding Issues
  2. Arabic Morphology
    • Introduction to Arabic Morphology
    • Arabic Morphological Analysis/Generation
  3. Arabic Syntax
    • Arabic Syntactic Phenomena
    • Arabic Parsing Issues
  4. Arabic Dialects
    • Introduction to Arabic Dialects
    • Processing of Arabic Dialects
NIZAR HABASH received his PhD in 2003 from the Computer Science Department, University of Maryland College Park. He is currently a researcher at the Center for Computational Learning Systems in Columbia University. His research includes work on machine translation, natural language generation, lexical semantics, and morphological analysis, generation and disambiguation. His work on Arabic ranges from research in Arabic encoding issues to Arabic-English machine translation and includes computational modeling of Arabic dialects for machine translation and speech recognition, and Arabic dialect parsing.


< back to Tutorials