CL Scholar: The ACL Anthology Knowledge Graph Miner

We present CL Scholar, the ACL Anthology knowledge graph miner to facilitate high-quality search and exploration of current research progress in the computational linguistics community. In contrast to previous works, periodically crawling, indexing and processing of new incoming articles is completely automated in the current system. CL Scholar utilizes both textual and network information for knowledge graph construction. As an additional novel initiative, CL Scholar supports more than 1200 scholarly natural language queries along with standard keyword-based search on constructed knowledge graph. It answers binary, statistical and list based natural language queries. The current system is deployed at http://cnerg.iitkgp.ac.in/aclakg. We also provide REST API support along with bulk download facility. Our code and data are available at https://github.com/CLScholar.


Introduction
ACL Anthology 1 is one of the popular initiatives of the Association for Computational Linguistics (ACL) to curate all publications related to computational linguistics and natural language processing at one common place. At present, it hosts more than 44,000 papers and is actively updated and maintained by Min Yen Kan. Since its inception, ACL Anthology functions as a repository with the collection of papers from ACL and related organizations in computational linguistics. However, it does not provide any additional statistics about authors, papers, venues, and topics. Also, it lacks advance search features such as article ranking by factoring in popularity or relevance, natural language query support, author profiles, topical search etc.

Previous systems built on ACL anthology
Owing to above limitations, ACL anthology remained an archival repository for quite a long time. Bird et al. (2008) developed the ACL Anthology Reference Corpus (ACL ARC) as a collaborative attempt to provide a standardized testbed reference corpus based on the ACL Anthology. Later, Radev et al. (2009) have invested humongous manual efforts to construct The ACL Anthology Network Corpus (AAN). AAN consists of a manually curated database of citations, collaborations, and summaries and statistics about the network. They have utilized two OCR processing tools PDFBox 2 and ParsCit (Councill et al., 2008) for curation. AAN was continuously updated till 2013 (Radev et al., 2013). Recently, this project has been moved to Yale University as part of the new LILY group 3 .

The computational linguistic knowledge graph
As a similar initiative, in this paper, we demonstrate the development of CL Scholar which automatically mines ACL anthology and constructs computational linguistic knowledge graph (hereafter 'CLKG'). The current framework automatically crawls new articles, processes, indexes, constructs knowledge graph and generates searchable statistics without involving tedious manual annotations. We leverage state-of-the-art scientific article processing tool OCR++ (Singh et al., 2016) for robust and automatic information extraction from scientific articles. OCR++ is an open-source framework that can extract from scholarly articles the metadata, the structure and the bibliography.
The constructed CLKG is modeled as a heterogeneous graph (Sun et al., 2009) consisting of four entities: author, paper, venue, and field. We utilize metapaths (Sun and Han, 2012) to implement the query retrieval framework.

Natural language queries
In the first-of-its-kind initiative, we extend the functionalities of CL Scholar to answer natural language queries (hereafter 'NLQ') along with standard keyword-based queries. Currently, it answers binary, statistical and list based N LQ. Overall, we handle more than 1200 variations of N LQ. Outline: The rest of the paper is organized as follows. Section 2 describes the ACL Anthology dataset. Section 3 details step by step extraction procedure for CLKG construction. In section 4, we describe CLKG. We describe our framework in section 5. We conclude in section 6 and identify future work. 2 Dataset CL Scholar uses metadata and full-text PDF research articles crawled from ACL Anthology. ACL Anthology consists of more than 40,000 research articles published in more than 33 computational linguistic events (venues) including conferences, workshops, and journals. Table 2 presents general statistics of the crawled dataset.
We crawl both metadata information (unique article identifier, article title, authors' names, and venue) as well as full-text PDF articles. Next, we describe in detail several pre-processing steps and knowledge graph construction methodology.

Pre-processing and knowledge graph construction
We process full-text PDFs using state-of-the-art extraction tool OCR++ (Singh et al., 2016). We extract references, citation contexts, author affiliations and URLs from full-text. OCR++ also provides reference to citation contexts mapping. Raw information with several variations like author names, venue names and affiliations are assigned unique identifiers using standard indexing approaches. We only consider those reference papers that are present in ACL anthology. This rich textual, as well as citation relationship information, is utilized in the construction of CLKG. Figure 1 presents the CLKG construction from metadata and full-text PDF files crawled from ACL anthology.

Computational linguistic knowledge graph
Computational linguistic knowledge graph (CLKG) is a heterogeneous graph (Sun et al., 2009) consisting of four entities: author (A), paper (P ), venue (V ) and field (F ) as nodes.
Each entity is associated with few properties, for example, properties of P are publication year, title, abstract, etc. Similarly, properties of A are name, publication trend, affiliation etc. We utilize metapaths (Sun and Han, 2012) between entities to express semantic relations. For example, simple metapaths like A→P and V →P represent "author of" and "published at" relations respectively, whereas complex metapaths like V →A→P and F →A→P represent "authors of papers published at" and "authors of papers in" relations respectively. We leverage metapaths to develop CL Scholar (described in the next section).

CL Scholar
CL scholar fetches information from CLKG as per the input query from the user. The current framework is divided into two modules -1) natural language based query retrieval, and 2) entity specific query retrieval. Figure 3 shows CL Scholar framework.

Natural language query retrieval
The first module answers natural language queries (N LQ). It consists of two sub-modules, 1) the query classifier, and 2) the NL query processor. Query classifier classifies user input into one of the three basic types of N LQ using regular expression patterns. NL query processor processes query based on its type determined by query classifier. Given an input natural language query, we utilize longest subsequence match to identify entity instances. The three types of N LQ are: 1. Binary queries: These represent a set of queries for which user demands a 'yes' or 'no' type answer. Table 4 lists few interesting binary queries. 2. Statistical queries: These represent set of queries which the knowledge base returns with some statistics. Currently, we support three types of statistics -1) temporal, 2) cumulative, and 3) comparison. Temporal represents year-wise statistics, cumulative represents overall statistics and comparison represents comparative statistics between two or more instances of the same entity type. Table 4 lists few representative statistical queries. 3. List queries: These represent set of queries for which the knowledge base returns a list of papers, authors or venues. Table 4 also enumerates few representative list queries.

Entity specific query retrieval
CL scholar also supports entity specific retrieval. As described in section 4, CLKG consists of four entities: paper, author, venue, and field. Currently, our system supports three 4 entity specific retrieval schemes handled by three sub-modules: 1. Paper specific: This sub-module returns paper specific information. Currently, we retrieve and display author names and affiliations, abstract, publication year and venue, cumulative and year-wise citations, list of references, citer papers, co-cited papers present in ACL anthology and list of URLs present in the paper text. We also show average sentiment score received by the queried paper by utilizing incoming citation contexts. Table 5 shows three representative paper specific queries. 2. Author specific: This sub-module handles author specific queries. Given an author name, the system shows its cumulative and year-wise publication and citation count, collaborator list with an average number of collaborations, current and temporal H-index and temporal topic distribution. We also list author's publications in ACL anthology. Table 5 lists three author specific queries with first name, last name and full name respectively. 3. Venue specific: We also answer venue specific queries. For each venue specific query, the system shows cumulative and year-wise  Table 5 shows three representative venue specific queries.

Additional insights
We provide two additional insights by analyzing incoming citation contexts. First, we present a summary generated from incoming the citation contexts (Qazvinian and Radev, 2008). Currently, we show five summary sentences against each paper. Second, we also compute sentiment score of each citation context by leveraging a standard sentiment analyzer (Athar and Teufel, 2012). We aggregate by averaging over the sentiment score of all the incoming citation contexts.

Ranking
Currently, we employ popularity based ranking of retrieved results. We utilize current citation count as a measure of popularity. In future, we plan to deploy other ranking schemes like recency, impact, sentiment, relevance, etc.

Deployment
CL Scholar is developed using ReactJS framework. The system also supports REST API requests which are powered by a NodeJS server with data being served using MongoDB. It is currently accessible at our research group page 5 . More information about API usage is available at API support page 6 . In addition, the entire knowledge graph can also be easily downloaded in a plain text format. Figure 6 shows a snapshot of the CL Scholar landing page. Figure 6: Snapshot of CL Scholar landing page.
The current system is still under development. Currently, we assume that spellings are correct for NLQ. We do not support instant query search. We also do not support query recommendations.

Conclusion
In this paper, we propose a fully automatic approach for the development of computational linguistic knowledge graph from full-text PDF articles available in ACL Anthology. We also develop first-of-its-kind academic natural language query retrieval system. Currently, our system can answer three different types of natural language queries. In future, we plan to extend the query set. We also plan to append structural information within knowledge graphs such as section labeling of citations, figure and table captions etc. We also plan to conduct extensive evaluation to compare CL Scholar with state-of-the-art systems.