NLP Scholar: An Interactive Visual Explorer for Natural Language Processing Literature

As part of the NLP Scholar project, we created a single unified dataset of NLP papers and their meta-information (including citation numbers), by extracting and aligning information from the ACL Anthology and Google Scholar. In this paper, we describe several interconnected interactive visualizations (dashboards) that present various aspects of the data. Clicking on an item within a visualization or entering query terms in the search boxes filters the data in all visualizations in the dashboard. This allows users to search for papers in the area of their interest, published within specific time periods, published by specified authors, etc. The interactive visualizations presented here, and the associated dataset of papers mapped to citations, have additional uses as well including understanding how the field is growing (both overall and across sub-areas), as well as quantifying the impact of different types of papers on subsequent publications.


Introduction
NLP is a broad interdisciplinary field that draws knowledge from Computer Science, Linguistics, Information Science, Psychology, Social Sciences, and more. 1 Over the years, scientific publications in NLP have grown in number and diversity; we now see papers published on a vast array of research questions and applications in a growing list of venues-in journals such as CL and TACL, in large conferences such as ACL and EMNLP, as well as a number of small area-focused workshops.
The ACL Anthology (AA) is a digital repository of public domain, free to access, articles on NLP. 2t includes papers published in the family of ACL conferences as well as in other NLP conferences such as LREC and RANLP.As of June 2019, it provided access to the full text and metadata for close to 50K articles published since 1965. 3It is the largest single source of scientific literature on NLP.However, the meta-data does not include citation statistics.
Citation statistics are the most commonly used metrics of research impact.They include: number of citations, average citations, h-index, relative citation ratio, and impact factor.Note, however, that the number of citations is not always a reflection of the quality or importance of a piece of work.Furthermore, the citation process can be abused, for example, by egregious self-citations (Ioannidis et al., 2019).Nonetheless, given the immense volume of scientific literature, the relative ease with which one can track citations using services such as Google Scholar (GS), and given the lack of other easily applicable and effective metrics, citation analysis is an imperfect but useful window into research impact.
Google Scholar is a free web search engine for academic literature. 4Through it, users can access the metadata associated with an article such as the number of citations it has received.Google Scholar does not provide information on how many articles are included in its database.However, scientometric researchers estimated that it included about 389 million documents in January 2018 (Gusenbauer, 2019)-making it the world's largest source of academic information.Thus, it is not surprising that there is growing interest in the use of Google Scholar information to draw inferences about scholarly research in general (Martín-Martín et al., 2018;Mingers and Leydesdorff, 2015;Orduña-Malea et al., 2014;Khabsa and Giles, 2014;Howland et al., 2009) and on scholarly impact in particular (Bos and Nitza, 2019;Ioannidis et al., 2019;Ravenscroft et al., 2017;Bulaitis, 2017;Yogatama et al., 2011;Priem and Hemminger, 2010).
Services such as Google Scholar and Semantic Scholar cover a wide variety of academic disciplines.Wile there are benefits to this, the lack of focus on NLP literature has some drawbacks as well: e.g, the potential for too many search results that include many irrelevant papers.For example, if one is interested in NLP papers on emotion and privacy, searching for them on Google Scholar is less efficient than searching for them on a platform dedicated to NLP papers.Further, services such as Google Scholar provide minimal interactive visualizations.NLP Scholar with its focus on AA data, is not meant to replace these tools, but act as a complementary tool for dedicated visual search of NLP literature.
ACL 2020 has a special theme asking researchers to reflect on the state of NLP.In the spirit of that theme, and as part of a broader project on analyzing NLP Literature, we extracted and aligned information from the ACL Anthology (AA) and Google Scholar to create a dataset of tens of thousands of NLP papers and their citations (Mohammad, 2020c(Mohammad, , 2019)).In separate work, we have used the data to explores questions such as: how well cited are papers of different types (journal articles, conference papers, demo papers, etc.)? how well cited are papers published in different time spans?how well cited are papers from different areas of research within NLP? etc. (Mohammad, 2020a).We also explored gender gaps in Natural Language Processing research, in terms of authorship and citations (Mohammad, 2020b).In this paper we describe how we built an interactive visual explorer for this unified data, which we refer to as NLP Scholar.Some notable uses of NLP Scholar are listed below: • Search for relevant related work in various areas within NLP.
• Identify the highly cited articles on an interactive timeline.
• Identify past papers published in a venue of interest (such as ACL or LREC).
• Identify papers from the past (say ten years back) published in a venue of interest (say ACL or LREC) that have made substantial impact through citations.
• Examine changes in number of articles and number of citations in a chosen area of interest over time.
• Identify citation impact of different types of papers-e.g., short papers, shared task papers, demo papers, etc.
Even beyond the dedicated interactive visualizer described here, the underlying data with its alignment between AA and GS has potential uses in: • Creating a web browser extension that allows users of GS to look up the aligned AA information (the full ACL BibTeX, poster, slides, access to proceedings from the same venue, etc.).
• Similarly, in the reverse direction, allowing access from AA to the GS information on the aligned paper.This could include number of citations, lists of papers citing the paper, etc.
Perhaps most importantly, though, NLP Scholar serves as a visual record of the state of NLP literature in terms of citations.We note again though, that even though this work seeks to make citation metrics more accessible for ACL Anthology papers, citation metrics are not always accurate reflections of the quality, importance, or impact of individual papers.
All of the data and interactive visualizations associated with this work are freely available through the project homepage.5

Background and Related Work
Much of the work in visualizing scientific literature has focused on showing topics of research (Wu et al., 2019;Heimerl et al., 2012;Lee et al., 2005).There is also notable work on visualizing communities through citation networks (Heimerl et al., 2015;Radev et al., 2016).
However, none of these works provide an interactive visualization for users to explore NLP literature and their citations.

Data
We now briefly describe how we extracted information from the ACL Anthology and Google Scholar.
(Further details about the dataset, as well as an analysis of the volume of research in NLP over the years, are available in Mohammad (2020c).)

ACL Anthology Data
The ACL Anthology provides access to its data through its website and a github repository (Gildea et al., 2018). 6We extracted paper title, names of authors, year of publication, and venue of publication from the repository. 7s of June 2019, AA had ∼50K entries; however, this includes forewords, schedules, etc. that are not truly research publications.After discarding them we are left with a set of 44,895 papers.

Google Scholar Data
Google Scholar does not provide an API to extract information about the papers.This is likely because of its agreement with publishing companies that have scientific literature behind paywalls (Martín-Martín et al., 2018).We extracted citation information from Google Scholar profiles of authors who published at least three papers in the ACL Anthology.(This is explicitly allowed by GS's robots exclusion standard.This is also how past work has studied Google Scholar (Khabsa and Giles, 2014;Orduña-Malea et al., 2014;Martín-Martín et al., 2018).)This yielded citation information for 1.1 million papers in total.We will refer to this dataset as GS-NLP.Note that GS-NLP includes citation counts not just for NLP papers, but also for non-NLP papers published by the authors.
GS-NLP includes 32,985 of the 44,895 papers in AA (about 74%).We will refer to this subset of the 4 Building an Interactive Visualization to Explore Scientific Literature We now describe how we created an interactive visualization-NLP Scholar-that allows one to visually explore the data from the ACL Anthology along with citation information from Google Scholar.We first created a relational database (involving multiple tables) that stores the AA and GS data ( §4.1).We then loaded the database in Tableau-an interactive data visualization software-to build the visualizations ( §4.2).9

NLP Scholar Relational Database
Data from AA and GS is stored in four tables (tsv files): papers, authors, title-unigrams, and titlebigrams.They contain the following information: papers: Each row corresponds to a unique paper.The columns include: paper title, year of publication, list of authors, venue of publication, number of citations at the time of data collection (June 2019), NLP Scholar paper id, ACL paper id, and some other meta-data associated with the paper.
The NLP Scholar paper id is a concatenation of the paper title, year of publication, and first author last name.(This id was also used to align entries across AA and GS).authors: Each row corresponds to a paper-author combination.The columns include: NLP Scholar paper id, author first name, and author last name.A paper with three authors contributes three rows to the table (all three have the same paper id, but different author names).title-unigrams: Each row corresponds to a paper title and unigram combination.The columns include: NLP Scholar paper id and paper title unigram (a word that occurs in the title of the paper).A paper with five unique words in the title contributes five rows to the table (all five have the same paper id, but different words).title-bigrams: Each row corresponds to a paper title and bigram combination.The columns include: NLP Scholar paper id and paper title bigram (a two-word sequence that occurs in the title of the paper).A paper with four unique bigrams in the title contributes four rows to the table (all four have the same paper id, but different bigrams).
Once the tables are loaded in Tableau, the following pairs of tables are each joined (inner join) using the NLP Scholar paper id:10 papers-authors, papers-title-unigrams, and papers-title-bigrams.

NLP Scholar Interactive Visualization
We developed multiple visualizations to explore various aspects of the data.We group and connect several individual visualizations in dashboards that allow one to explore several aspects of the data together.Clicking on data attributes such as year of publication or venue of publication in one visualization, filters the data in all visualizations within a dashboard to show only the relevant data.
Figure 1 shows a screenshot of the main dashboard.At the top are the number of papers-total (A1) and by year of publication (A2).This allows one to see the growth/decline of the papers over the years.
Below it, we see the number of citations-total (B1) and by year of publication (B2).For a given year, the bar is partitioned into segments corresponding to individual papers.Each segment (paper) has a height that is proportional to the number of citations it has received and assigned a colour at random.This allows one to quickly identify high-citation papers. 11overing over individual papers in B2 pops open an information box showing the paper title, authors, year of publication, publication venue, and #citations.Figure 6 in the Appendix shows a blow up of B2 along with examples of the hover information box.Similarly, hovering over other parts of the dashboard shows corresponding information.(This is especially helpful, when parts of the text are truncated or otherwise not visible due to space constraints.) Further below, we see lists of papers (C) and authors (D)-both are ordered by number of citations.Search boxes in the bottom right (E) allow searching for papers that have particular terms in the title or searching for papers by author name.One can also restrict the search to a span of years using the slider.
Four other dashboards are also created that have the same five elements as the main dashboard (A through E), and additionally include a six element F to provide a focused search facility.This sixth element is a treemap that shows the most common: venues and paper types (F1), title unigrams (F2), title bigrams (F3), or language mentions in the title (F4).(We only show one of the four treemaps at a time to prevent overwhelming the user.)The treemaps are shown in Figures 2 to 5, respectively.

Data Explorations with NLP Scholar
Figure 1 A1 shows that the dataset includes 44,895 papers.A2 shows that the volume of papers published was considerably lower in the early years (1965 to 1989); there was a spurt in the 1990s; and substantial numbers since the year 2000.Also, note that the number of publications is considerably higher in alternate years.This is due to certain biennial conferences.Since 1998 the largest of such conferences has been LREC (In 2018 alone LREC had over 700 main conferences papers and additional papers from its 29 workshops).COLING, another biennial conference (also occurring in the even years) has about 45% of the number of main conference papers as LREC.
B1 shows that AA papers have received ∼1.2 million citations (as of June 2019).The timeline graph in B2 shows that, with time, not only have the number of papers grown, but also the number of high-citation papers.We see a marked jump in the 1990s over the previous decades, but the 2000s are the most notable in terms of the high number of citations.The 2010s papers will likely surpass the 2000s papers in the years to come.
The most cited papers list (C) shows influential papers from machine translation, sentiment analysis, word embeddings, syntax, and semantics.
Among the authors (D), observe that Christopher Manning has not only received the most number of citations, he has also received almost three times as many citations as the next person in the list.Search: NLP Scholar allows for search in a number of ways.Suppose we are interested in the topic of sentiment analysis.Then we can enter the relevant keywords in the search box: sentiment, valence, emotion, emotions, affect, etc.Then the visualizations are filtered to present details of only those papers that have at least one of these keywords in the title.(Future work will allow for search in the abstract and the whole text.) Figure 7 in the Appendix shows the filtered result.The system identified 1,481 papers that each have at least one of the query terms in the title.They have received more than 85K citations.The citations timeline (B2 in Figure 7) shows that there were just a few scattered papers in early years (1987)(1988)(1989)(1990)(1991)(1992)(1993)(1994)(1995)(1996)(1997)(1998)(1999)(2000) that received a small number of citations.However, two papers in 2002 received a massive number of citations, and likely led to    the substantially increased interest in the field.The number of papers has steadily increased since 2002, with close to 250 papers in 2018, showing that the area continues to enjoy considerable attention.
One can also fine tune the search as desired.Say we are interested not in the broad area of sentiment analysis, but specifically in the work on emotions and affect.Then they can enter only emotion-and affect-related keywords.A disadvantage of using terms for search is that some terms are ambiguous and they can pull in irrelevant articles; also if a paper is about the topic of interest but its title does not have one of the standard keywords associated with the topic, then it might be left out.That said, if one does come across a paper that has the query term but is not in the topic of interest, they can right click and exclude that paper from the visualization; and as mentioned before, future work will allow for searches in the abstract and full text as well.We are also currently working on clustering papers using the words in the articles as features. 12elow are some more examples of interactions with NLP Scholar (Figures are in the • Figures 16,17,and 18 show the dashboard when one clicks on the Title Bigrams treemap (F3): machine translation, question answering, and word embeddings, respectively.
• Figures 19 and 20 show the dashboard when one clicks on the Languages treemap (F4): Chinese and Swahili, respectively.
Once the system goes live, we hope to collect further usage scenarios from the users at large.For this work, we chose not to stem the terms in the titles before applying the search.This is because in some search scenarios, it is beneficial to distinguish the different morphological forms of a word.For example, papers with emotions in the titles are more likely to be dealing with multiple emotions than papers with the term emotion.When such distinctions do not need to be made, it is easy for users to include morphological variants as additional query terms.

Conclusions and Future Work
We presented NLP Scholar-an interactive visual explorer for the ACL Anthology.Notably, the tool also has access to citation information from Google Scholar.It includes several interconnected interactive visualizations (dashboards) that allow users to quickly and efficiently search for relevant related work by clicking on items within a visualization or through search boxes.All of the data and interactive visualizations associated with this work are freely available through the project homepage. 13uture work will provide additional functionalities such as search within abstracts and whole texts, document clustering, and automatically identifying related papers.We see NLP Scholar, with its dedicated visual search capabilities for NLP papers, as a useful complementary tool to existing resources such as Google Scholar.We also note that the approach presented here is not required to be applied only to the ACL Anthology or NLP papers; it can be used to display papers from other sources too such as pre-print archives and anthologies of papers from other fields of study.

Figure 2 :
Figure 2: A treemap of popular NLP venues and paper types.Darker shades of green: higher volumes of papers.

Figure 3 :
Figure 3: A treemap of the most common unigrams in paper titles.Darker shades of green: higher frequencies.

Figure 4 :
Figure 4: A treemap of the most common bigrams in paper titles.Darker shades of green: higher frequencies.

Figure 5 :
Figure5: A treemap of the most common language terms in titles.Darker shades of green: higher frequencies.
Appendix after references):• Figure8shows the state of the visualization when one clicks the year 2016 in A1.• Figures 9 and 10 show examples of author search by clicking on the authors list (D) (Christopher Manning and Lillian Lee).• Figures 11 and 12 show the dashboard when one clicks on the Venue and Paper Type treemap (F1): ACL main conference papers and Workshop papers, respectively.• Figures 13, 14 and 15 in the Appendix also show examples of search for the terms parsing, statistical and neural, respectively (accessed by clicking on the title unigrams treemap (F2)).

Figures 6
Figures 6 through 20 (in the pages ahead) show example interactions with NLP Scholar that were discussed in Section 5.

Figure 6 :
Figure 6: NLP Scholar: Hovering over individual papers in B2 pops open an information box showing the paper title, authors, year of publication, publication venue, and #citations.

Figure 7 :
Figure 7: NLP Scholar: After entering terms associated with sentiment analysis in the search box.

Figure 8 :
Figure 8: NLP Scholar: After clicking on the 2016 bar in the #papers by year viz (A2).

Figure 11 :
Figure 11: NLP Scholar: After clicking on 'ACL' in the venue and paper type treemap (F1).

Figure 12 :
Figure 12: NLP Scholar: After clicking on 'Workshops' in the venue and paper type treemap (F1).