2008Q3 Reports: Anthology
ACL ANTHOLOGY Report, May 2008 Min-Yen Kan
The ACL Anthology is a digital archive of research papers in computational linguistics, sponsored by the CL community, and freely available to all. It includes the Computational Linguistics journal, and proceedings of many conferences and workshops including: ACL, EACL, NAACL, ANLP, TINLAP, COLING, HLT, MUC, and Tipster. Conference proceedings are published in the anthology around the same time as the conference. CL articles are published in the anthology roughly one year in arrears (but individual subscribers can access recent issues electronically via the MIT Press website).
The anthology now contains over 13,600 papers (up from 12,500 papers twelve months ago), along with full-text search (provided by Google's Custom Search API). Most of the papers are also indexed by Citeseer and Google Scholar, helping the citation counts of ACL authors. e.g. the following Google Scholar search reported nearly 8,000 results: http://scholar.google.com/scholar?q=site%3Aacl.ldc.upenn.edu. The ACM Digital Library is creating rich metadata and doing full citation linking for all anthology materials.
CHANGES IN EDITORS: Steven Bird stepped down after six years of service in starting the initial anthology project and tracking down, acquiring, converting and correcting almost all of the ACL's backdated publications and ingesting them into the Anthology. Min-Yen Kan was appointed to take over Steven's role and assumed editorship in January 2008.
ADDITIONS OVER LAST 12 MONTHS: I has been ingesting missing materials from the Anthology including EMNLP 01; EMNLP 04, EACL 03 Workshops, MUC 98, SIGdial '03, SIGdial '04. With these additions we have an almost complete archive of related ACL materials up to the recent present. I have also linked to the MT Archives that houses back issues of Mechanical Translation and Computational Linguistics. Coupled together, we have a digital archive of CL related materials from 1954-2006. Constant, smaller additions of current materials is likely to be the focus now. Towards this goal, I have also updated the Anthology with recent materials from: CL Vol 32 ('06); Euro Workshop on NLG 07.
MAILING LIST: The Anthology mailing list's (http://groups.google.com/group/acl-anthology) membership pool has grown, now consisting of 94 members. This is an annoucement-only list.
HOSTING: The Anthology is now hosted on ACL's own website. The LDC website is no longer authoritative. A web redirect has been set up to re-route traffic appropriately
FACELIFT: A change in the HTML code of the Anthology was done in February, after piloting with the mailing list group's members in January. The facelift was done to simplify the HTML code for maintainence, and to factor stylistic rendering from the HTML code into an Anthology-wide stylesheet.
SIG PAGES: Each SIG now contributes its own Anthology page. These are to be maintained by each SIG exec committee though a configuration file (in YAML format). SIGs will send updates to these configuration files to me for editing and insertion into the live website. Note that this is an interim measure -- see ONGOING ACTIVITIES
FUTURE MATERIALS: Aside from regular ACL meetings, currently, IJCNLP 05 is scheduled to be ingested in December when the window for exclusive copyright expires with Springer. IJCNLP 08 is also in the process of being ingested.
DIGITAL OBJECT IDENTIFIERS: DOIs are akin to ISBN numbers, but apply to individual papers. They are now the standard way to uniquely identify an academic paper, and web services will be available for resolving DOIs to papers (e.g. http://dx.doi.org/). ACM helps us in assigning DOIs to published ACL materials. I'm working with them to make DOI assignment more timely and investigating whether we can have DOIs assigned to papers as they are published (so that each paper's copyright notice may be able to print its own DOI).
PUBLICATION INSTRUCTIONS: I am proactively attempting to contact each ACL event's publication chair to ensure that they know the process to have their proceedings ingested into the Anthology. In this way we can try to minimize the lag between publication and appearance in the Anthology (when ACL is the sole publisher).
HIGHER-QUALITY BIBLIOGRAPHIC METADATA: The ACM Digital Library is creating high-quality bibliographic metadata for each individual paper, in conjunction with registering each paper with a DOI. It should be possible to extract that metadata and improve the quality of metadata on the Anthology site (e.g., removing OCR errors in the spelling of author and paper names). I will also be manually verifying and editing records in the Anthology on a regular and systematic basis.
WIKIFIED EDITING: I plan to bring the metadata of the Anthology into a Wiki form that allows editing to be easily done by the general public. I plan to start with a pilot data and user set -- the SIG pages -- and expand the program if it is successful and poses limited security problems. I plan to roll this out on a trial basis in late 2008.
INTEGRATION WITH OTHER GRASSROOTS PROJECTS: A number of grassroots projects as proposed at ACL 2007 center around the Anthology. I plan to organize and incorporate as much user contributed data as possible, where feasible. These would include the Anthology Network, Video Anthology and the Linked Anthology proposals. Thus far, part of the Anthology Network and the raw text extraction has dovetailed together nicely on a standardized subset of the Anthology.