EMNLP versus ACL: Analyzing NLP research over time

The conferences ACL (Association for Computational Linguistics) and EMNLP (Empirical Methods in Natural Language Processing) rank among the premier venues that track the research developments in Natural Language Processing and Computational Linguistics. In this paper, we present a study on the research papers of approximately two decades from these two NLP conferences. We apply keyphrase extraction and corpus analysis tools to the proceedings from these venues and propose probabilistic and vector-based representations to represent the topics published in a venue for a given year. Next, similarity metrics are studied over pairs of venue representations to capture the progress of the two venues with respect to each other and over time.


Introduction
Scientific findings in a subject-area are typically published in conferences, journals, patents, and books in that domain. These research documents constitute valuable resources from the perspective of data mining applications. For instance, the citation links among research documents are used in computing bibliometric quantities for authors (Alonso et al., 2009) whereas topic models on research corpora are used to distinguish between influential and impactful researchers (Kataria et al., 2011) and to capture temporal topic trends (He et al., 2009).
Despite several potential benefits mentioned above and the free availability of most research proceedings in NLP through the ACL Anthology 1 , the topical and temporal aspects of this corpus are yet to be fully studied in current literature. In this paper, we present our study on research proceedings of approximately two decades from two leading NLP conferences, namely ACL and EMNLP, to complement a previous study on this topic by Hall et al (2008). To the best of our knowledge, we are the first to characterize the developments in the NLP domain using a comparative study of two of its leading publication venues. Our contributions are summarized below: 1. We represent the NLP research corpus from approximately two decades as a keyphrasedocument matrix and apply Latent Dirichlet Allocation (Blei et al., 2003) to extract coherent topics from it (Newman et al., 2010).
2. We propose two novel representations for summarizing the venue proceedings in a given year. (1) The probabilistic representation expresses each venue as a probability distribution over topics, whereas (2) the TP-ICP representation captures topics that are the major focus in the venue for a particular year via Topic Proportion (TP) as well as topic importance as measured with inverse corpus proportion (ICP).
3. We apply Jensen-Shannon divergence and cosine similarity on our proposed venue representations to analyze the venues over time. Specifically, we ask the following questions: What are the popular topics in ACL and EMNLP in a particular year? Is the topical focus in EMNLP different from ACL? How did the topical focus in each venue change over time?
Organization: We describe our novel venue representations and the measures used to compare them in Section 2. The details of our datasets and experiments are presented in Section 3 along with results and observations. We summarize related research in Section 4 before concluding the paper in Section 5.

Methods
Let Y = {y 1 , y 2 . . . y T } be the consecutive years in which the research proceedings are available from V , set of publication venues under consideration (V = {"ACL", "EMNLP"} in this paper). If D is the set of all documents over the years, each document d ∈ D is associated with {K d , y, v} where K d refers to the content of d whereas v and y refer to the venue and year in which d was published.

Venues as Probability Distributions
Let t 1 , t 2 . . . t k denote the topics capturing the content of documents in D. Using probabilistic topic modeling and dimension reduction tools such as Latent Dirichlet Allocation or pLSA (Hofmann, 1999;Blei et al., 2003), we extract for each d ∈ D, P (t i |d), i = 1 . . . k, the multinomial distribution over the topics associated with d.
The venue-topic probability distribution P (t i |v y ) for a given (venue, year) pair (v = l, y = m) can be computed using D l,m , the set of documents published in venue l, in the year m. That is, Note that the above probabilistic representation facilitates a quantitative measure to compare two venues: the divergence between the probability distributions of the two venues. The Kullback−Leibler divergence (KLD) between two (discrete) probability distributions P and Q is given by:

Venues as TP-ICP Vectors
Discrete probability distributions are often represented in computations as normalized vectors. For instance, the P (t i |v) values comprise the components of a k-dimensional vector. This topic proportion (TP) vector is similar to the normalized term frequency vector commonly used in Information Retrieval (IR) . TP values are fractions indicating the percentage of a given topic among all topics covered in a particular year. Thus these values are higher for topics that are the major focus in the venue for a particular year . We also extend inverse document frequency, a popular concept that is used to weigh terms in IR  to describe Inverse Corpus Proportion or ICP. Our objective in defining ICP is to capture the importance of a topic by diminishing the effect of topics that are common across all years. Let TP v,y (i) represents the proportion of topic t i in venue v for year y, then Given two TP-ICP vectors P = [p 1 , p 2 , . . . p k ] and Q = [q 1 , q 2 , . . . q k ], the similarity between them using the cosine measure is given by:

Keyphrases for representing documents
Corpus analysis tools often use bag-of-words models and term frequencies for representing documents (Heinrich, 2005). However, research documents are often well-structured, and contain various sections with author information, citations,  3 7 4 17 24 20 21 27 9 28 0 19 6 10 23 2 5 18 15 25 26 29 8 13 14 11 16 1   and content-related sections such as abstract, related work, and experiments. To best represent the topical content of these documents, we harness the latest work on keyphrase extraction for research documents and represent documents using keyphrases (Hasan and Ng, 2014). We use the ExpandRank algorithm (Wan and Xiao, 2008) to extract top n-grams ∀d ∈ D. Ex-pandRank effectively combines PageRank values on word graphs with text similarity scores between documents to score n-grams for a document and was shown to outperform other unsupervised keyphrase extraction methods on research documents in absence of other information such as citations (Gollapalli and Caragea, 2014).

Experiments
Datasets and setup: We crawled the ACLWeb for research papers from EMNLP and ACL from the year 1996 through 2014 2 using the Java-based crawler, Heritrix 3 . The text from the PDF documents was extracted using the PDFBox software 4 after which simple rules similar to the ones used in CiteSeer (Li et al., 2006) were employed to extract the "body" of the research document 5 . The numbers of papers for each year at the end of this process are listed in Table 2 it appears that the paper "intake" in each conference has gone up overall during the last decade although occasionally the increase is due to colocation with related conferences such as IJCNLP and HLT 6 . We construct the keyphrase-document matrix using top-100 keyphrases of each document extracted with ExpandRank. The LDA implementation provided in Mallet (McCallum, 2002) was used to extract topics from this matrix. The LDA algorithm was run along with hyperparameter optimization (Minka, 2003) for different numbers of topics between 10 . . . 100 in increments of 10. We use the average corpus likelihood over ten randomly-initialized runs to choose the optimal number of topics that best "explain" the corpus (Heinrich, 2005). As indicated by the left plot in Figure 1 this optimum is obtained when the number of topics is 30.

Results and Observations
The top phrases that reflect the "theme" captured by a topic are shown in Table 1. As indicated in this table, we are able to extract coherent topics from the corpus using LDA on a documentkeyphrase matrix (AlSumait et al., 2009;Newman et al., 2010).
Top research topics in NLP: We select five timepoints {1996, 2000, 2005, 2010 Year ACL EMNLP 201427, 10, 20 27, 10, 19 20100, 5, 6 20, 19, 18 20059, 0, 19 24, 18, 0 20007, 9, 5 18, 23, 20 1996 28, 2, 5  2009 2008 2007 2006 2005 2004 2003 2002 2001 2000 1999 1998 1997 1996    and Right (EMNLP) plots show the JSD between a timepoint with preceding timepoints in the set {1996, 2000, 2005, 2010, 2014}. equal parts and examine the top topics for ACL and EMNLP at these time points. We rank the topics in each conference by their TP-ICP values and list the top 3 topics in the right table of Figure 1. "Semantic relation extraction", "sentiment analysis", and "topic models" are the top research topics in NLP last year (2014) whereas in the year 1996, the topics "noun phrase extraction", "summarization", "corpus modeling", and "speech recognition" dominated the NLP research arena. From the table, it can be seen that "information retrieval" (topicID: 18) ranks among the top topics in EMNLP for all three timepoints during 2000-2010 whereas "natural language generation" (topicID: 9) was consistently addressed during 1996-2005 in ACL. EMNLP versus ACL: We compare the venues using JSD and Cosine similarity measures in the middle plot of Figure 1. The plot shows decreasing divergence between the topical distributions over the years and increasing cosine similarity between the TP-ICP vectors for the venues. These trends indicate that over the two decades the two venues ACL and EMNLP seem to have "become like each other" although their topical focus was different during the initial years. Increasingly, both venues seem to publish papers on similar top-ics. This behavior could be interpreted to mean that the NLP research field is more stable now with two of its leading conferences addressing problems on similar topics.
Changing topical focus over the years: In the first plot of Figure 2, we show the Jensen-Shannon divergence between the topic distributions of a particular venue for a given year y and (y − 1), the year preceding it. The curve indicates that in the years between 1997-2008, the rate of change from year to year is higher than in the years following 2008. We split the time period

Related Work
Temporal analysis of corpora is an upcoming research topic in text mining groups. Topic models were particularly investigated for detecting activity patterns in corpora annotated with time information (Huynh et al., 2008;Shen et al., 2009). Evolution of topics and their trends were studied on research corpora from NIPS (Wang and McCallum, 2006) as well as CiteSeer (He et al., 2009).
In contrast with existing approaches that seek to model the detection of new topics and their evolution, we focus on representing different venues pertaining to a research field and examine their development over time by comparing them against each other. In a similar study, Hall et al. (2008) examined the emergence of topics in NLP literature. They proposed "topic entropy" to measure the diversity in three conferences from the ACL Anthology during the years 1978-2006. They also noted that all the venues seem to converge in the topics they cover over the years based on the JSD between their topic distributions.