Author List Clean-up Code

From Admin Wiki
Revision as of 14:07, 21 July 2011 by Ahakim (talk | contribs)

Jump to: navigation, search


A big challenge in automatically creating an anthology from publications is correcting author names. Many different versions of author names are found in different publications.

For example, in the ACL Anthology, there are 5 different versions of the author name "Rosé, Carolyn Penstein" 's name, as shown below.

Rose, Carolyn P. Rosé, CarolynPenstein Rosé, Carolyn P. PensteinRosé, Carolyn P. Rosé, Carolyn

In order to resolve this, we have created a semi-automatically cleaned list of all author names in ACL anthology. The "master list" of author names contains 13,692 different authors. In addition to the master list, we provide code for the following tasks

1. Finding the canonical version of different author names in the field of computational linguistics, if it exists in a master list (available as part of the package) using many different heuristics.

2. Automatically change different versions of the name to the suggested canonical name (incorporating any manual corrections by the user, if any)