ACL Anthology

Steven Bird

ACL Anthology is a digital archive of research papers in computational
linguistics. Once it is completed, it will contain approximately 50,000
pages of documents spanning a 40 year period. All of this material is
being made freely available to ACL members and non-members alike, in
commemoration of our 40th anniversary. The Anthology will be officially
launched on the opening day of our 40th annual meeting, in Philadelphia on
July 8, 2002.

The impact of the ACL Anthology will be far reaching. It opens our
literature to newcomers and makes our field more accessible. The full
content index makes it easy for us to discover past papers that are
relevant to our research. As authors we benefit from the fact that online
articles are more highly cited, and the fact that our papers can be
discovered and indexed by citation services (e.g. citeseer). As teachers
we can assign readings to our students as easily as sending them a URL.
With easy access to early materials we can study the history of ideas in
our field, or analyze the development of new trends. Finally, we can turn
our own NLP technologies loose on our corpus of publications, and develop
innovative new services based on the content.

CURRENT STATUS

ACL Proceedings 1979-1999 introduction by John Nerbonne
CL Journal 1980-2001 introduction by Julia Hirschberg
EACL Proceedings 1983-1999 introduction by Donia Scott
ANLP Proceedings 1983-2000 introduction by Sergei Nirenburg
NAACL Proceedings 2000 introduction by Diane Litman
TINLAP Proceedings 1978,1987 introduction by Bonnie Webber

Total Size: 22,000 pages, 120 volumes, 3,100 papers
equivalent to a stack 6'6" / 2m tall


ORGANIZATION

Anthology:
* PDFs in the format "PDF Image with Hidden Text", with HTML index pages
* Full content indexed by web search engines (e.g. Google)

Archive:
* 600dpi greyscale tiffs
* RTF OCR output


THE PROCESS OF BUILDING THE ANTHOLOGY

1. Inventory
(a) Proceedings held in the ACL office - Priscilla Rasmussen
(includes page counts, number of copies, location, date)
(b) Maintaining inventory spreadsheet, track donors - Steven Bird
2. Collecting out-of-print materials
(a) Old ACL proceedings - Mike Rosner
(located several proceedings for late 1970s)
(still looking for any papers or abstracts prior to 1975)
(b) ACL workshop proceedings - Steven Bird
(located donors for about 20 volumes)
(c) COLING proceedings - Nicoletta Calzolari
(located donors for all COLINGs, 1965-)
3. Obtain copyright
(a) Identify copyright holder for each volume - Priscilla Rasmussen
(b) Obtain letter from U Montreal for COLING/ACL-98 - Elliot Macklovitch
(c) Announce intention to republish materials - Steven Bird
(d) Notify Chicago and MIT Press re MT&CL - Steven Bird
4. Scanning
(a) Specification - Steven Bird & Pinehurst Technologies
(b) Collecting, packing and shipping proceedings - Priscilla Rasmussen
(c) Scanning, deskewing, PDF compilation, OCR - Pinehurst Technologies
(d) Quality control - Steven Bird
5. Sponsorship
(a) North American sponsors - Eduard Hovy
(b) European sponsors - John Nerbonne
(c) Publicity, processing payments - Priscilla Rasmussen
6. Website
(a) Upload, indexes, search interface, logos from sponsors - Steven Bird
(b) Write introductory essays - John Nerbonne, Julia Hirschberg, Donia
Scott, Sergei Nirenburg, Diane Litman, Bonnie Webber
(c) Write historical essays - Aravind Joshi, Karen Sparck-Jones, Mitch Marcus


ONGOING AND FUTURE WORK

Current scanning work (June 2002): 60 ACL workshops, TINLAP 1975

Future scanning work:
* COLING proceedings (1965-), COLING workshops (July-September 2002)
* Any ACL proceedings prior to 1975
* microfiche journal issues (1974-78)
* Finite String Newsletter (1964-1995)
* Mechanical Translation and Computational Linguistics (1961-68)

Other tasks:
* Integrate proceedings already published on CDROM - David Yarowsky
* BibTeX data - Doug Arnold
* Establish process for integrating future proceedings - Steven Bird
* Other formats: DjVu, XML OCR data - Yann LeCun, Steven Bird
* Clean up metadata - TBD
* Extract abstracts - TBD
* Extract references and crosslink - TBD
* DVD publication - Steven Bird
* Maintain long-term website - TBD
* Specialized metadata for subject classification - TBD
* Specialized metadata for language identification - TBD
* Identify mirror sites - Steven Bird


COSTS

The original estimate was 60c/page, for 300dpi b+w scans. This is the
current industry standard for written documents. However, at the ACL
executive committee meeting in Toulouse, it was agreed that 600dpi would be
preferrable if it wasn't too much more expensive. Considering the b+w,
16-level and 256-level grayscale options, I have selected 16-level,
balancing quality and cost issues.

The price for 600dpi, 16-level grayscale scanning is 80c/page. This figure
includes deskewing, creation of TIFF, RTF and PDF files according to the
specification. The XML index files cost around 50c/paper. Delivery on
DVD-ROM is $175 each (supplied in duplicate for safe storage in two
locations). When the per-paper and per-DVD costs are added in, the project
comes to almost exactly $1/page. If all the materials can be collected, the
entire project will cost around $50k.

The scanning company, Pinehurst Technologies Inc, was highly recommended by
Michael Ley (Editor, ACM SIGMOD Anthology) and Bernard Rous (Electronic
Publishing Program Director, ACM), and their work has been of the highest
quality.

As of June 2002, over 90% of the needed funds have been either donated or
pledged. This funding is being used solely for the scanning work. The
rest of the anthology is "sweat-ware".


COPYRIGHT

The ACL and ICCL hold copyright on all the volumes being scanned. However,
in the case of proceedings volumes, this copyright only applies to the
compilation; the copyright of the individual papers remains with the
authors. Any authors who object to having their papers included in the
anthology should contact Steven Bird to have their papers removed.


ANTHOLOGY SUPPORTERS

As of June 2002, the supporters of the anthology project are as follows:

DONORS OF MULTIPLE VOLUMES: Betty Walker, Jane Robinson, Helen Gigley,
Karen Sparck Jones, Jussi Karlgren, Eva Hajicova, Petr Sgall, Antonio
Zampolli, Nicoletta Calzolari, Elliott Macklovitch, Kenneth Church, Eduard
Hovy, Steven Bird.
DONORS OF SINGLE VOLUMES: Elizabeth Andre, Sabine Bergler, Jason Eisner,
Andy Kehler, Guy Lapalme, Alberto Lavelli, Winfried Lenders, Bente
Maegaard, Mark Maybury, Kathleen McCoy, David McDonald, Ruslan Mitkov,
Johanna Moore, John Nerbonne, Yael Netzer, Miles Osborne, Hannes Pirker,
James Pustejovsky, Mike Rosner, Donia Scott, Evelyne Viegas, Bonnie Webber,
Yorick Wilks.
DONORS OF FINANCIAL SUPPORT: Margaret Fleck, Lillian Lee, Pierre Nugues.
INSTITUTIONAL DONORS: Gold level: Advanced Research and Development
Activity; Silver level: German Research Center for Artificial Intelligence,
Information Sciences Institute (USC), Linguistic Data Consortium (UPenn),
MITRE Corporation, Netherlands Organization for Scientific Research, Xerox
Research Centre Europe; Bronze level: Center for Language and Speech
Processing (JHU), City University of Hong Kong, Columbia University,
Information Technology Research Institute (Brighton), Macquarie University,
School of Behavioral and Cognitive Neurosciences (Groningen).
AUTHORS OF ANTHOLOGY INTRODUCTIONS: John Nerbonne, Julia Hirschberg, Donia
Scott, Diane Litman, Mark Johnson, Martin Kay, Sergei Nirenburg, Bonnie
Webber, Ralph Weischedel, Mitch Marcus, Karen Sparck-Jones, and Aravind
Joshi.
ASSOCIATE EDITORS: Eduard Hovy, John Nerbonne, Mike Rosner, Nicoletta
Calzolari, David Yarowsky and Doug Arnold.

The ACL Executive Committee expresses its deep gratitude to these
individuals and institutions for their wonderful support.

----