ACL ANTHOLOGY Report, July 2007 Steven Bird The ACL Anthology is a digital archive of research papers in computational linguistics, sponsored by the CL community, and freely available to all. It includes the Computational Linguistics journal, and proceedings of many conferences and workshops including: ACL, EACL, NAACL, ANLP, TINLAP, COLING, HLT, MUC, and Tipster. Conference proceedings are published in the anthology around the same time as the conference. CL articles are published in the anthology one year in arrears (but individual subscribers can access recent issues electronically via the MIT Press website). The anthology now contains over 12,500 papers (up from 11,000 papers twelve months ago), along with full-text search. Most of the papers are also indexed by Citeseer and Google Scholar, helping the citation counts of ACL authors. e.g. the following Google Scholar search reported nearly 8,000 results: http://scholar.google.com/scholar?q=site%3Aacl.ldc.upenn.edu. The ACM Digital Library is creating rich metadata and doing full citation linking for all anthology materials. ADDITIONS OVER LAST 12 MONTHS: Proceedings from HLT-NAACL-07, ACL-06, EACL-06, HLT-NAACL-06, ACL-04, COLING-04, HLT-NAACL-04; CL Vol 31 (2005). CONVERSION OF LEGACY MATERIALS: The 2004 conferences all used idiosyncratic directory layouts, filenames and HTML formats, and were converted manually by a student assistant at Melbourne University. MAILING LIST: A new mailing list has been set up for announcements concerning new materials added to the Anthology: http://groups.google.com/group/acl-anthology FUTURE MATERIALS: The ACL publication software generates conference CD-ROMs using the same directory layout and file-naming conventions as the anthology, streamlining the online publication process. BibTeX files are automatically generated and made available to users. The journal and any SIG workshops not held in conjunction with an ACL meeting will continue to require manual processing. DIGITAL OBJECT IDENTIFIERS: DOIs are akin to ISBN numbers, but apply to individual papers. They are now the standard way to uniquely identify an academic paper, and web services will be available for resolving DOIs to papers (e.g. http://dx.doi.org/). MIT Press assigns DOIs to CL articles, and the ACM is in the process of assigning DOIs to each anthology item. Before the ACM could do this we had to join CrossRef and get a "DOI Prefix" (our prefix is 10.3115). The nominal cost for DOI assignment is $1 per article; the ACM will cover the cost for past materials, while the ACL will cover the cost of DOI assignment for anthology materials from 2006 onwards. IJCNLP: From 2008 onwards, IJCNLP proceedings will appear in the Anthology. ONGOING ACTIVITIES HIGHER-QUALITY BIBLIOGRAPHIC METADATA: The ACM Digital Library is creating high-quality bibliographic metadata for each individual paper, in conjunction with registering each paper with a DOI. It should be possible to extract that metadata and improve the quality of metadata on the Anthology site (e.g. removing OCR errors in the spelling of author and paper names). PUBLICATION INSTRUCTIONS: The instructions for the publication software need to be updated to cover two further tasks: (i) obtaining the workshop identifiers from the Anthology editor, and (ii) uploading the materials to the anthology by FTP. Conferences and workshops not held in conjunction with a regular ACL meeting are not automatically included in the Anthology. Organizers of such events should consider using the ACL publication software and contacting the Anthology editor to ensure timely incorporation of the proceedings in the Anthology. TIMING: Conference and workshop organizers have a variety of opinions about exactly when proceedings should appear in the Anthology (e.g. before, during, or after the event). I recommend that the ACL Executive establish a standard practice here. Journal papers appear 1-2 years after they are published, as decided by the Executive when the Anthology was founded. However, the ACM Digital Library publishes them much sooner (within a few months of publication?), for free access. I presume this practice has not hurt subscriptions, in which case I propose that the Anthology do the same. IJCNLP: The 2005 conference proceedings were not included in the anthology because they were published by Springer. Springer agreed to make the proceedings freely available online after one year, but this has not yet happened. Would the IJCNLP-05 organizers be able to supply the materials? ACM DL: Our ACM Digital Library contact, Bernard Rous, has asked to receive CD-ROMs of ACL conferences as they are published, so that he can initiate the process of assigning DOIs. His address is: Bernard Rous, Electronic Publishing Program Director, ACM, 2 Penn Plaza Suite 701, New York NY 10121-0701 TEXT EXTRACTION: There is an initiative to extract plain text from the ACL Anthology materials, involving Dragomir Radev, Min-Yen Kan and others. Most of the Anthology has been converted, and can be found at http://wing.comp.nus.edu.sg/~min/dAnth/acl/. This will facilitate the application of NLP techniques to our own publications. TOPICAL INDEXING: The existence of persistent URLs makes it easy for individuals and special interest groups to set up annotated bibliographies with pointers to papers in the anthology. Moreover, the community's own text categorization techniques ought to be applied to its own text collection. The anthology site should link to any well-curated, comprehensive categorizations of its content, so that members of the CL community can benefit from them. The new ACL Wiki would be a convenient place for members to maintain topical indexes of ACL papers. LONG-TERM MAINTENANCE: The Anthology "project" is almost concluded: (a) materials from the ACL's hardcopy and microfiche eras are now all digitized; (b) born-digital materials published in ad hoc formats have been manually converted; and (c) ACL's publications software supports publication via the Anthology. Final steps are to (d) update index pages with high-quality bibliographic metadata provided by the ACM; (e) host the primary copy of proceedings on the ACL website; and (f) assign well-documented tasks to a website manager and to the publications chair of ACL conferences, under the oversight of the ACL Secretary or designee. Once these steps have been carried out, the Anthology will be fully incorporated into the ACL's operation.