2018Q3 Reports: ACL Anthology

From Admin Wiki
Jump to navigation Jump to search

The ACL Anthology is a digital archive of research papers in computational linguistics, sponsored by the CL community, and freely available to all. As of 2016, we employ a Creative Commons Attribution license for materials published by ACL. This makes our content usable by the general public with attribution to the ACL (although it is not mandatory for any user to inform us of their use of our materials). Dual licensing for a fee is presumably possible (although not exercised currently).

The Anthology now contains over 44,600 (up from 42,100 papers in the last report in Q3). As many may know there are two versions of the Anthology, a legacy version at http://www.aclweb.org/anthology and a current one at http://aclanthology.info . Due to time constraints by the Editor, and the fact that we have both Anthologies for over 2 years as a transition period, we have stopped maintaining the legacy site. Work still remains to port all of the accepted functionality over to the current site (i.e., handling errata, full text search, redirections from the legacy site to the current one, and archiving access statistics among others). However, the development of new features is heavily constrained by the time bandwidth of the Editor, as the popularity of NLP/CL publications continues to grow, and hence its workload for all tasks routine and exceptional. The current Anthology is physically hosted at Universität des Saarlands, in a well-resourced virtual machine, but may need to find a new home soon due to volunteers ending their term with the hosting group.

Anthology Steering Committee (ASC) and the Anthology Advisory Board (AAB). The ASC was created at the behest of the ACL Anthology Editor to oversee the Anthology, as the responsibility for the entire CL/NLP literature is too large for a single person to manage. The ASC consists of Jing-Shin Chang as ACL Info Officer, Min-Yen Kan as ACL Anthology Editor and Paola Merlo as CL Editor. Attempts to convene the ASC this past year failed due to several missed meetings and difficulty finding good availability between its three members. With the pressing need for better management, Marti Hearst, ACL President, commissioned a Anthology Advisory Board (consisting of Steven Bird as an informal adviser, Bonnie Dorr, Min-Yen Kan as ACL Anthology Editor, Dragomir Radev as chair, Stuart Shieber, Mark Steedman and Simone Teufel) to find and replace the current Editor by end of 2018. This Board is currently working on the matter.

Anthology Server Home. The Anthology's current home at Saarlands will possibly need to change soon as the two volunteers helping the Editor to maintain it are leaving soon. There may be no technical staff that can manage the Anthology. There were plans to integrate the Anthology with the VPS shared hosting of aclweb.org, but is of current, there is no firm date or plan to do the migration. This is an issue that needs to be resolved.

Mailing List. The Anthology mailing list's (http://groups.google.com/group/acl-anthology) membership pool has grown, now consisting of 895 members (up from 763 from a year ago). This is an announcement-only list, where we notify members of newly listed released materials online.

Auxiliary Ingestion. The Anthology now has ingestion workflows for software, datasets, general attachments, slides and posters that are hosted with the ACL Anthology. One important problem is with keeping certain media updated in the publication workflows for conferences -- supplemental materials, videos and posters are ingestable and have processes within the Anthology Editor's workflows, but these seem less well formed for individual conferences. We recommend that the Conference chair come up with good interfacing between authors, vendors and the Anthology.

Digital Object Identifiers. We have assigned DOIs to all ACL materials in 2015 and have assigned ones to all current material, minus TACL. With our current practice of assigning DOIs to all materials, our costs are likely to escalate to at least US$ 3K as we digitally publish at least this amount of scholarly articles.

ACL Anthology Reference Corpus version 3 (ACL ARC 3). We have gotten blessings from the LDC, for distribution of a new reference corpus. This is a priority in the development of the Anthology as a scientific corpus itself but needs to be delayed until the current Editor can step down and concentrate on this development.

Work Queue. The current state of ingestion and development of the ACL Anthology is publicly available on the ACL Anthology's footer. New this report is the incorporating of historical ingestion logs as well. https://docs.google.com/spreadsheets/d/166W-eIJX2rzCACbjpQYOaruJda7bTZrY7MBw_oa7B2E/pubhtml

Accomplishments for this year

  1. Negotiating with Google Scholar to index aclanthology.info. While done, this is a temporary fix until the Anthology can find a home within www.aclweb.org as required by Google Scholar.
  2. End of lifetime support for the legacy Anthology.
  3. Incorporation of ACL 2017 materials with DOIs printed on the actual paper proceedings. Unfortunately, there has not been take up of this idea with more recent conference chairs.
  4. Containerization of the Anthology. We need volunteers to create a network of mirrors to the Anthology.
  5. Publication of a short paper in the NLP-OSS workshop with volunteer group.
  6. Moving the Anthology (-PDFs) from the NUS underresourced Virtual Machine to Universität des Saarlands.
  7. Massive CL/NLP BibTeX file. We recreate a single BibTeX file for all Anthology materials after a period of ingestions.
  8. Incorporating of historical ingestion logs in the public Google Doc.
  9. Refinement of outstanding issues into the GitHub open source codebase.
  10. Migration the GitHub codebase to the central ACL GitHub account.

Plans, Prioritized

  1. We are capturing abstracts (albeit with some noise) from START, but have yet to successfully integrate this into the ACL Anthology's index.
  2. For long-term preservation, to create a XML representation of all of the metadata used to create the Anthology. This is similar in nature to the XML dump of DBLP or Wikipedia. It allows a clean separation of the underlying data in the Anthology from the code used to present it.
  3. Collaboration with START (also may involve the Conference Officer's work) to integrate user accounts in their system. This would allow START to have authority records for authors such that new paper submissions might start with correct, canonical forms of author names. The ASC is aware of ORCIDs and other name authority systems that might also be useful in this process.
  4. Collaboration with ELRA to allow the categorization of papers against the LRE Map and ISLRNs.
  5. To allow third-party applications to automatically annotate articles with new metadata on existing papers via an API. Such an API is a production API, allowing third-parties to add auto-analyzed materials to the Anthology (e.g., auto-extracted keywords, summaries). This will raise the visibility of the Anthology as a object of study, complementing work on the ACL ARC.