ACL Anthology: Importing New Materials

The bulk of the materials in the Anthology were scanned from paper copy. Since 2000, ACL materials have been 'born digital', and distributed on CD-ROM. In order to be incorporated into the Anthology, the contributors of these materials should first map them to the required structure and supply the required metadata. This document describes these requirements in detail. Please read this entire document in detail if you are planning to contribute to the Anthology.

ACL Standard CD-ROM Layout: Recently, the ACL has adopted a layout and associated software tools developed by David Yarowsky. Conferences that use this layout can benefit from the tools that automatically map from the CD-ROM layout to the required structure for the Anthology. See Delivery below.

Responsibilities

Conference or publications chairs should work with the chairs of any associated workshops and/or collocated events to ensure which person will be responsible for importing the proceedings into the ACL Anthology. For efficiencies of scale, the responsibility by default lies with the conference publications chair.

The ACL Anthology Editor will undertake the responsibility to ensure that data is in the correct format. However, the responsibility for individual metadata corrections during the ingestion process lie with the contributor. Long-term maintenance and correction of metadata will be borne by ACL Anthology Editor.

Delivery

Contributors of Anthology materials should tar and gzip the directory, then contact the ACL Anthology Editor (currently Min-Yen Kan) for instructions on SCPing the archive to the Anthology preview upload area.

Conference or workshop proceedings chairs using the aclpub publication software package can use the build target anthologize.pl (found under aclpub/bin/) to build the deliverable target in one step. For more information about the aclpub package, see http://www.cis.udel.edu/~carberry/ACL/publications-repository-access.txt.

Upon receipt, the Editor will build a preview site for contributors to check for errors before pushing the final version to the live ACL Anthology site. An announcement will also be done to interested parties via the ACL Anthology Google Group (Note: you must have a google account to access this group).

Paper numbers

A key notion in file naming is the paper number. The general rule is that papers will be numbered consecutively within the bound volume in which they appear. When a proceedings is divided into multiple volumes, paper number will begin from number 1 with each new volume. When multiple proceedings are bound into a single volume (e.g. N00), they will be treated as multiple volumes.

Any front matter is given the number zero. Any back matter is given the number one more than the last paper in the volume. Front and back matter that appears internally to a volume (e.g. in N00) will be counted just like an ordinary paper.

File naming

All PDF files will begin with the set specifier (one character) and the year (two digits), followed by an optional within-year issue number, then a paper number. The file names fit within the 8.3 DOS naming constraint for maximum portability, as recommended by Adobe. PDF filenames are globally unique, to support subsetting (saving an ad hoc collection of papers to a single directory). The numeric fields within filenames are zero-padded to ensure filenames have fixed width. Files are grouped by set and year, as follows:

CodesSets Filename ExampleComments
A, C, D, E, H, I, L, M, N, P, S, T, X Proceedings (syy-xnnn) P90-1001.pdf This is the first paper appearing in the first (or only) volume of the P90 proceedings; most proceedings have just one volume; in rare cases a proceedings volume has a supplement which should be numbered as a separate volume; in rare cases multiple proceedings volumes are bound into one (N00/A00) and these should be treated as separate volumes. Each conference proceedings may have up to 999 papers; conferences with more papers than this upper limit should consult the ACL Anthology Editor on how to split the proceedings into separate volumes. Proceedings chairs of conferences may choose at their discretion how they would like to partition volumes, although segmentations involving main/full papers, short papers, demonstrations and tutorial abstracts are most common.
J,F Journal (Jyy-xnnn) J90-2001 This is for the first paper in the second issue of J90; For combined issues, like 3/4, use the first number of the sequence. (E.g. if a journal year consists of combined issues 1/2 and 3/4, use J90-1 and J90-3 only). The 'F' prefix is for the Finite String, a newsletter (now obsolete) that used to be part of the Journal.
W Workshops (Wyy-xxnn) W90-0201 This is for the first paper in the second workshop in 1990; there is space for up to 100 workshops per year, and up to 99 papers per workshop. If a workshop exceeds over 99 papers in a year, please consult the ACL Anthology Editor (currently Min-Yen Kan) first. In this case, a separate set letter code may be established for the venue.

Workshop chairs should contact the ACL Anthology Editor to receive their workshop number offset (the 'xx' portion of the ACL Anthology ID). If your workshop is attached to a conference as a satellite event, please contact the proceedings chair for the main conference to receive the offset ID, as it is easiest to allocate offsets as a whole block. Conversely, if you are the proceedings chair for a conference that has satellite workshops, please contact the ACL Anthology Editor with the final list of titles of the workshops (make certain the workshops will actually be run) so that the editor can allocate a suitable block of offsets to the workshops.

The two digit year scheme will be good for another 50 years, by which time I hope the DOS 8.3 restriction is known only to digital archaeologists. Occasionally, old identifiers may be superceded by newer identifiers, see the indirection listing for complete details.

Directory structure

The PDF files are grouped into directories by set (at the top level) and then by year. The above examples are stored as follows: P/P90/P90-1001.pdf, J/J90/J90-2001.pdf, W/W90/W90-0201.pdf, F/F71/F71-0010.pdf.

For joint conferences that have been held, directory links are created. Thus C98 is just a pointer to P98, and A00 is just a pointer to N00. Which code letter a joint conference's proceedings will appear under is set at the editor's discretion.

Referencing Anthology Papers

Although the directory structure described above is up to date, please note that if you wish to reference a paper (for example in creating BibTeX records for a proceedings if you are a proceedings chair), you should use the canonical format, as the actual path and location of the Anthology may vary over time. The canonical reference form uses indirection to locate the correct copy.

http://www.aclweb.org/anthology/P90-1001

Note that the canonical form does not use .pdf extension (as papers may be saved in different formats in the future), and substitutes anthology/ for the current implemented path (as this may also change over time).

In some cases, particularly with more recent materials published using the aclpub package, you can reference a complete volume of a conference also using a standard identifier.

http://www.aclweb.org/anthology/P08-1

should reference the full paper volume of the ACL proceedings for 2008 (Note: currently this functionality is broken as the ACL Anthology redirection does not handle these instances correctly).

Aside from the paper itself (which can be referenced through the base canonical form and using a .pdf suffix), BibTeX (.bib) files for most individual papers, and ACL Anthology XML (.xml) files for complete volumes and proceedings can often be found.

Metadata

Anthology index pages are generated from an XML file containing information about volumes, paper titles and authors. Each publication has an XML file, stored alongside the scanned images. Proceedings chairs using the aclpub publication package have a build target that will build a richer version of the metadata below. Here is a fragment of P/P03/P03.xml, which satisfies these minimal requirements, generated automatically from the conference CD-ROM:

<?xml version="1.0" encoding="UTF-8" ?>
 <volume id="P03">
   <paper id="1000">
        <title>Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics</title>
   </paper>

   <paper id="1001">
        <title>Offline Strategies for Online Question Answering: Answering Questions Before They Are Asked</title>
        <author>Michael Fleischman</author>
        <author>Eduard Hovy</author>
        <author>Abdessamad Echihabi</author>
   </paper>

   <paper id="1002">
        <title>Using Predicate-Argument Structures for Information Extraction</title>
        <author>Mihai Surdeanu</author>
        <author>Sanda Harabagiu</author>
        <author>John Williams</author>
        <author>Paul Aarseth</author>
   </paper>

   <paper id="1003">
        <title>A Noisy-Channel Approach to Question Answering</title>
        <author>Abdessamad Echihabi</author>
        <author>Daniel Marcu</author>
   </paper>

   ...

 </volume>

Last updated: Wed Jul 15 10:04:23 SGT 2009

ACL Anthology