Friendships, Rivalries, and Trysts: Characterizing Relations between Ideas in Texts

Understanding how ideas relate to each other is a fundamental question in many domains, ranging from intellectual history to public communication. Because ideas are naturally embedded in texts, we propose the first framework to systematically characterize the relations between ideas based on their occurrence in a corpus of documents, independent of how these ideas are represented. Combining two statistics—cooccurrence within documents and prevalence correlation over time—our approach reveals a number of different ways in which ideas can cooperate and compete. For instance, two ideas can closely track each other’s prevalence over time, and yet rarely cooccur, almost like a “cold war” scenario. We observe that pairwise cooccurrence and prevalence correlation exhibit different distributions. We further demonstrate that our approach is able to uncover intriguing relations between ideas through in-depth case studies on news articles and research papers.


Introduction
Ideas exist in the mind, but are made manifest in language, where they compete with each other for the scarce resource of human attention.Milton (1644) used the "marketplace of ideas" metaphor to argue that the truth will win out when ideas freely compete; Dawkins (1976) similarly likened the evolution of ideas to natural selection of genes.We propose a framework to quantitatively characterize competition and cooperation between ideas in texts, independent of how they might be represented.
By "ideas", we mean any discrete conceptual units that can be identified as being present or absent in a document.In this work, we consider representing ideas using keywords and topics obtained in an unsupervised fashion, but our way of characterizing the relations between ideas could be applied to many other types of textual representations, such as frames (Card et al., 2015) and hashtags.
What does it mean for two ideas to compete in texts, quantitatively?Consider, for example, the issue of immigration.There are two strongly competing narratives about the roughly 11 million people1 who are residing in the United States without permission.One is "illegal aliens", who "steal" jobs and deny opportunities to legal immigrants; the other is "undocumented immigrants", who are already part of the fabric of society and deserve a path to citizenship (Merolla et al., 2013).
Although prior knowledge suggests that these two narratives compete, it is not immediately obvious what measures might reveal this competition in a corpus of writing about immigration.One question is whether or not these two ideas cooccur in the same documents.In the example above, these narratives are used by distinct groups of people with different ideologies.The fact that they don't cooccur is one clue that they may be in competition with each other.
However, cooccurrence is insufficient to express the selection process of ideas, i.e., some ideas fade out over time, while others rise in popularity, analogous to the populations of species in nature.Of the two narratives on immigration, we may expect one to win out at the expense of another as public opinion shifts.Alternatively, we might expect to see these narratives reinforcing each other, as both sides intensify their messaging in response to growing opposition, much like the U.S.S.R. and Figure 1: Relations between ideas in the space of cooccurrence and prevalence correlation (prevalence correlation is shown explicitly and cooccurrence is encoded in row captions).We use topics from LDA (Blei et al., 2003) to represent ideas.Each topic is named with a pair of words that are most strongly associated with the topic in LDA.Subplots show examples of relations between topics found in U.S. newspaper articles on immigration from 1980 to 2016, color coded to match the description in text.The y-axis represents the proportion of news articles in a year (in our corpus) that contain the corresponding topic.All examples are among the top 3 strongest relations in each type except ("immigrant, undocumented", "illegal, alien"), which corresponds to the two competing narratives.We explain the formal definition of strength in §2.
the U.S. during the cold war.To capture these possibilities, we use prevalence correlation over time.
Building on these insights, we propose a framework that combines cooccurrence within documents and prevalence correlation over time.This framework gives rise to four possible types of relation that correspond to the four quadrants in Fig. 1.We explain each type using examples from news articles in U.S. newspapers on immigration from 1980 to 2016.Here, we have used LDA to identify ideas in the form of topics, and we denote each idea with a pair of words most strongly associated with the corresponding topic.
Friendship (correlated over time, likely to cooccur).The "immigrant, undocumented" topic tends to cooccur with "obama, president" and both topics have been rising during the period of our dataset, likely because the "undocumented immigrants" narrative was an important part of Obama's framing of the immigration issue (Haynes et al., 2016).
Head-to-head (anti-correlated over time, unlikely to cooccur)."immigrant, undocumented" and "illegal, alien" are in a head-to-head competition: these two topics rarely cooccur, and "immigrant, undocu-mented" has been growing in prevalence, while the usage of "illegal, alien" in newspapers has been declining.This observation agrees with a report from Pew Research Center (Guskin, 2013).
Tryst (anti-correlated over time, likely to cooccur).The two off-diagonal examples use topics related to law enforcement.Overall, "immigration, deportation" and "detainee, detention" often cooccur but "detainee, detention" has been declining, while "immigration, deportation" has been rising.This possibly relates to the promises to overhaul the immigration detention system (Kalhan, 2010). 2   Arms-race (correlated over time, unlikely to cooccur).One of the above law enforcement topics ("immigration, deportation") and a topic on the Republican party ("republican, party") hold an armsrace relation: they are both growing in prevalence over time, but rarely cooccur, perhaps suggesting an underlying common cause.
Note that our terminology describes the relations between ideas in texts, not necessarily between the entities to which the ideas refer.For example, we find that the relation between "Israel" and "Palestine" is "friendship" in news articles on terrorism, based on their prevalence correlation and cooccurrence in that corpus.
We introduce the formal definition of our framework in §2 and apply it to news articles on five issues and research papers from ACL Anthology and NIPS as testbeds.We operationalize ideas using topics (Blei et al., 2003) and keywords (Monroe et al., 2008).
To explore whether the four relation types exist and how strong these relations are, we first examine the marginal and joint distributions of cooccurrence and prevalence correlation ( §3).We find that cooccurrence shows a unimodal normal-shaped distribution but prevalence correlation demonstrates more diverse distributions across corpora.As we would expect, there are, in general, more and stronger friendship and head-to-head relations than arms-race and tryst relations.
Second, we demonstrate the effectiveness of our framework through in-depth case studies ( §4).We not only validate existing knowledge about some news issues and research areas, but also identify hypotheses that require further investigation.For example, using keywords to represent ideas, a top pair with the tryst relation in news articles on terrorism is "arab" and "islam"; they are likely to cooccur, but "islam" is rising in relative prevalence while "arab" is declining.This suggests a conjecture that the news media have increasingly linked terrorism to a religious group rather than an ethnic group.We also show relations between topics in ACL that center around machine translation.
Our work is a first step towards understanding relations between ideas from text corpora, a complex and important research question.We provide some concluding thoughts in §6.

Computational Framework
The aim of our computational framework is to explore relations between ideas.We thus assume that the set of relevant ideas has been identified, and those expressed in each document have been tabulated.Our open-source implementation is at https://github.com/Noahs-ARK/idea_relations/.In the following, we introduce our formal definitions and datasets.∀x, y ∈ I, PMI(x, y) = log P (x, y) (2) Figure 2: Eq. 1 is the empirical pointwise mutual information for two ideas, our measure of cooccurrence of ideas; note that we use add-one smoothing in estimating PMI.Eq. 2 is the Pearson correlation between two ideas' prevalence over time.

Cooccurrence and Prevalence Correlation
As discussed in the introduction, we focus on two dimensions to quantify relations between ideas: 1. cooccurrence reveals to what extent two ideas tend to occur in the same contexts; 2. similarity between the relative prevalence of ideas over time reveals how two ideas relate in terms of popularity or coverage.
Our input is a collection of documents, each represented by a set of ideas and indexed by time.We denote a static set of ideas as I and a text corpus that consists of these ideas as C = {D 1 , . . ., D T }, where D t = {d t 1 , . . ., d t N t } gives the collection of documents at timestep t, and each document, d t k , is represented as a subset of ideas in I.Here T is the total number of timesteps, and N t is the number of documents at timestep t.It follows that the total number of documents N = T t=1 N t .In order to formally capture the two dimensions above, we employ two commonly-used statistics.First, we use empirical pointwise mutual information (PMI) to capture the cooccurrence of ideas within the same document (Church and Hanks, 1990); see Eq. 1 in Fig. 2. Positive PMI indicates that ideas occur together more frequently than would be expected if they were independent, while negative PMI indicates the opposite.
Second, we compute the correlation between normalized document frequency of ideas to capture the relation between the relative prevalence of ideas across documents over time; see Eq. 2 in Fig. 2. Positive r indicates that two ideas have similar prevalence over time, while negative r sug-gests two anti-correlated ideas (i.e., when one goes up, the other goes down).
The four types of relations in the introduction can now be obtained using PMI and r, which capture cooccurrence and prevalence correlation respectively.We further define the strength of the relation between two ideas as the absolute value of the product of their PMI and r scores: (3)

Datasets and Representation of Ideas
We use two types of datasets to validate our framework: news articles and research papers.We choose these two domains because competition between ideas has received significant interest in history of science (Kuhn, 1996) and research on framing (Chong and Druckman, 2007;Entman, 1993;Gitlin, 1980;Lakoff, 2014).Furthermore, interesting differences may exist in these two domains as news evolves with external events and scientific research progresses through innovations.
• News articles.We follow the strategy in Card et al. (2015) to obtain news articles from Lex-isNexis on five issues: abortion, immigration, same-sex marriage, smoking, and terrorism.We search for relevant articles using LexisNexis subject terms in U.S. newspapers from 1980 to 2016.Each of these corpora contains more than 25,000 articles.Please refer to the supplementary material for details.
• Research papers.We consider full texts of papers from two communities: our own ACL community captured by papers from ACL, NAACL, EMNLP, and TACL from 1980 to 2014 (Radev et al., 2009); and the NIPS community from 1987 to 2016.3There are 4.8K papers from the ACL community and 6.6K papers from the NIPS community.The processed datasets are available at https://chenhaot.com/ pages/idea-relations.html.
In order to operationalize ideas in a text corpus, we consider two ways to represent ideas.
• Topics.We extract topics from each document by running LDA (Blei et al., 2003) on each corpus C. In all datasets, we set the number of topics to 50. 4 Formally, I is the 50 topics learned from the corpus, and each document is represented as the set of topics that are present with greater than 0.01 probability in the topic distribution for that document.
• Keywords.We identify a list of distinguishing keywords for each corpus by comparing its word frequencies to the background frequencies found in other corpora using the informative Dirichlet prior model in Monroe et al. (2008).
We set the number of keywords to 100 for all corpora.For news articles, the background corpus for each issue is comprised of all articles from the other four issues.For research papers, we use NIPS as the background corpus for ACL and vice versa to identify what are the core concepts for each of these research areas.Formally, I is the 100 top distinguishing keywords in the corpus and each document is represented as the set of keywords within I that are present in the document.Refer to the supplementary material for a list of example keywords in each corpus.
In both procedures, we lemmatize all words and add common bigram phrases to the vocabulary following Mikolov et al. (2013).Note that in our analysis, ideas are only present or absent in a document, and a document can in principle be mapped to any subset of ideas in I.In our experiments 90% of documents are marked as containing between 7 and 14 ideas using topics, 8 and 33 ideas using keywords.

Characterizing the Space of Relations
To provide an overview of the four relation types in Fig. 1, we first examine the empirical distributions of the two statistics of interest across pairs of ideas.In most exploratory studies, however, we are most interested in pairs that exemplify each type of relation, i.e., the most extreme points in each quadrant.We thus look at these pairs in each corpus to observe how the four types differ in salience across datasets.

Empirical Distribution Properties
To the best of our knowledge, the distributions of pairwise cooccurrence and prevalence correlation have not been examined in previous literature.We thus first investigate the marginal distributions of cooccurrence and prevalence correlation and then our framework is to analyze relations between ideas, so this choice is not essential in this work.(Scott, 2015).The plots along the axes show the marginal distribution of the corresponding dimension.
In each plot, we give the Pearson correlation, and all Pearson correlations' p-values are less than 10 −40 .In these plots, we use topics to represent ideas.their joint distribution.Fig. 3 shows three examples: two from news articles and one from research papers.We will also focus our case studies on these three corpora in §4.The corresponding plots for keywords have been relegated to supplementary material due to space limitations.Cooccurrence tends to be unimodal but not normal.In all of our datasets, pairwise cooccurrence ( PMI) presents a unimodal distribution that somewhat resembles a normal distribution, but it is rarely precisely normal.We cannot reject the hypothesis that it is unimodal for any dataset (using topics or keywords) using the dip test (Hartigan and Hartigan, 1985), though D'Agostino's K 2 test (D'Agostino et al., 1990)  Cooccurrence is positively correlated with prevalence correlation.In all of our datasets, cooccurrence is positively correlated with prevalence correlation whether we use topics or keywords to represent ideas, although the Pearson correlation coefficients vary.This suggests that there are more friendship and head-to-head relations than tryst and arms-race relations.Based on the results of kernel density estimation, we also observe that this correlation is often loose, e.g., in ACL topics, cooccurrence spreads widely at each mode of prevalence correlation.

Relative Strength of Extreme Pairs
We are interested in how our framework can identify intriguing relations between ideas.These potentially interesting pairs likely correspond to the extreme points in each quadrant instead of the ones around the origin, where PMI and prevalence correlation are both close to zero.Here we compare the relative strength of extreme pairs in each dataset.We will discuss how these extreme pairs confirm existing knowledge and suggest new hypotheses via case studies in §4.
For each relation type, we average the strengths of the 25 pairs with the strongest relations in that type, with strength defined in Eq. 3.This heuristic (henceforth collective strength) allows us to collectively compare the strengths of the most prominent friendship, tryst, arms-race, and head-to-head relations.The results are not sensitive to the choice of 25.
Fig. 4 shows the collective strength of the four types in all of our datasets.The most common ordering is: friendship > head-to-head > arms-race > tryst.
The fact that friendship and head-to-head relations are strong is consistent with the positive correlation between cooccurrence and prevalence correlation.In news, friendship is the strongest relation type, but head-to-head is the strongest in ACL topics and NIPS topics.This suggests, unsurprisingly, that there are stronger head-to-head competitions (i.e., one idea takes over another) between ideas in scientific research than in news.We also see that topics show greater strength in our scientific article collections, while keywords dominate in news, especially in friendship.We conjecture that terms in scientific literature are often overloaded (e.g., a tree could be a parse tree or a decision tree), necessitating some abstraction when representing ideas.
In contrast, news stories are more self-contained and seek to employ consistent usage.

Exploratory Studies
We present case studies based on strongly related pairs of ideas in the four types of relation.Throughout this section, "rank" refers to the rank of the relation strength between a pair of ideas in its corresponding relation type.

International Relations in Terrorism
Following a decade of declining violence in the 90s, the events of September 11, 2001 precipitated a dramatic increase in concern about terrorism, and a major shift in how it was framed (Kern et al., 2003).As a showcase, we consider a topic which encompasses much of the U.S. government's response to terrorism: "federal, state". 5e observe two topics engaging in an "arms race" with this one: "afghanistan, taliban" and "pakistan, india".These correspond to two geopolitical regions closely linked to the U.S. government's concern with terrorism, and both were sites of U.S. military action during the period of our dataset.Events abroad and the U.S. government's responses follow the arms-race pattern, each holding increasing attention with the other, likely because they share the same underlying cause.
We also observe two head-to-head rivals to the "federal, state" topic: "iran, libya" and "israel, palestinian".While these topics correspond to regions that are hotly debated in the U.S., their coverage in news tends not to correlate temporally with the U.S. government's responses to terrorism, at least during the time period of our corpus.Discussion of these regions was more prevalent in the 80s and 90s, with declining media coverage since then (Kern et al., 2003).
The relations between these topics are consistent with structural balance theory (Cartwright and Harary, 1956;Heider, 1946), which suggests that the enemy of an enemy is a friend.The "afghanistan, taliban" topic has the strongest friendship relation with the "pakistan, india" topic, i.e., they are likely to cooccur and are positively correlated in prevalence.Similarly, the "iran, libya" topic is a close "friend" with the "israel, palestinian" topic (ranked 8th in friendship).
When using keywords to represent ideas, we observe similar relations between the term homeland security and terms related to the above foreign countries.In addition, we highlight an interesting but unexpected tryst relation arab and islam (Fig. 6).It is not surprising that these two words tend to cooccur in the same news articles, but the usage of islam in the news is increasing while arab is declining.The increasing prevalence of islam and decreasing prevalence of arab over this time period can also be seen, for example, using Google's n-gram viewer, but it of course provides no information about cooccurrence.This trend has not been previously noted to the best of our knowledge, although an article in the   5a shows the relations between the "federal, state" topic and four international topics.Edge colors indicate relation types and the number in an edge label presents the ranking of its strength in the corresponding relation type.Fig. 5b and Fig. 5c represent concrete examples in Fig. 5a: "federal, state" and "afghanistan, taliban" follow similar trends, although "afghanistan, taliban" fluctuates over time due to significant events such as the September 11 attacks in 2001 and the death of Bin Laden in 2011; while "iran, lybia" is negatively correlated with "federal, state".In fact, more than 70% of terrorism news in the 80s contained the "iran, lybia" topic.Huffington Post called for news editors to distinguish Muslim from Arab. 6 Our observation suggests a conjecture that the news media have increasingly linked terrorism to a religious group rather than an ethnic group, perhaps in part due to the tie between the events of 9/11 and Afghanistan, which is not an Arab or Arabic-speaking country.We leave it to further investigation to confirm or reject this hypothesis.
To further demonstrate the effectiveness of our approach, we compare a pair's rank using only cooccurrence or prevalence correlation with its rank in our framework.Table 1 shows the results for three pairs above.If we had looked at only cooccurrence or prevalence correlation, we would probably have missed these interesting pairs.PMI Corr "federal, state", "afghanistan, taliban" (#2 in arms-race) 43 99 "federal, state", "iran, lybia" (#2 in head-to-head) 36 56 arab, islam (#2 in tryst) 106 1,494 Table 1: Ranks of pairs by using the absolute value of only cooccurrence or prevalence correlation.

Ethnicity Keywords in Immigration
In addition to results on topics in §1, we observe unexpected patterns about ethnicity keywords in immigration news.Our observation starts with a top tryst relation between latino and asian.Although these words are likely to cooccur, their prevalence trajectories differ, with the discussion of Asian immigrants in the 1990s giving way to a focus on the word latino from 2000 onward.Possible theories to explain this observation include that undocumented immigrants are generally perceived as a Latino issue, or that Latino voters are increasingly influential in U.S. elections.Furthermore, latino holds head-to-head relations with two subgroups of Latin American immigrants: haitian and cuban.In particular, the strength of the relation with haitian is ranked #18 in headto-head relations.Meanwhile, haitian and cuban have a friendship relation, which is again consistent with structural balance theory.The decreasing prevalence of haitian and cuban perhaps speaks to the shifting geographical focus of recent immigration to the U.S., and issues of the Latino panethnicity.In fact, a majority of Latinos prefer to identify with their national origin relative to the Figure 7: Relations between ethnicity keywords in immigration news (HtH for head-to-head): latino holds a tryst relation with asian and head-to-head relations with two subgroups from Latin America, haitian and cuban.We do not show the relations between asian and haitian, cuban, because their strength is close to 0. pan-ethnic terms (Taylor et al., 2012).However, we should also note that much of this coverage relates to a set of specific refugee crises, temporarily elevating the political importance of these nations in the U.S. Nevertheless, the underlying social and political reasons behind these head-to-head relations are worth further investigation.

Relations between Topics in ACL
Finally, we analyze relations between topics in the ACL Anthology.It turns out that "machine translation" is at a central position among top ranked relations in all the four types (Fig. 8). 7It is part of the strongest relation in all four types except tryst (ranked #5).
The full relation graph presents further patterns.Friendship demonstrates transitivity: both "machine translation" and "word alignment" have similar relations with other topics.But such transitivity does not hold for tryst: although the prevalence of "rule, forest methods" is anti-correlated with both "machine translation" and "sentiment analysis", "sentiment analysis" seldom cooccurs with "rule, for-est methods" because "sentiment analysis" is seldom built on parsing algorithms.Similarly, "rule, forest methods" and "discourse (coherence)" hold an armsrace relation: they do not tend to cooccur and both decline in relative prevalence as "machine translation" rises.
The prevalence of each of these ideas in comparison to machine translation is shown in in Fig. 9, which reveals additional detail.

Related Work
We present two strands of related studies in addition to what we have discussed.Trends in ideas.Most studies have so far examined the trends of ideas individually (Michel et al., 2011;Hall et al., 2008;Rule et al., 2015).For instance, Hall et al. (2008) present various trends in our own computational linguistics community, including the rise of statistical machine translation.More recently, rhetorical framing has been used to predict these sorts of patterns (Prabhakaran et al., 2016).An exception is that Shi et al. (2010) use prevalence correlation to analyze lag relations between topics in publications and research grants.Anecdotally, Grudin (2009) observes a "head-tohead" relation between artificial intelligence and human-computer interaction in research funding.However, to our knowledge, our work is the first study to systematically characterize relations between ideas.Representation of ideas.In addition to topics and keywords, studies have also sought to operationalize the "memes" metaphor using quotes and text reuse in the media (Leskovec et al., 2009;Niculae et al., 2015;Smith et al., 2013;Wei et al., 2013).In topic modeling literature, Blei and Lafferty (2006) also point out that topics do not cooccur independently and explicitly model the cooccurrence within documents.

Concluding Discussion
We proposed a method to characterize relations between ideas in texts through the lens of cooccurrence within documents and prevalence correlation over time.For the first time, we observe that the distribution of pairwise cooccurrence is unimodal, while the distribution of pairwise prevalence correlation is not always unimodal, and show that they are positively correlated.This combination suggests four types of relations between ideas, and these four types are all found to varying extents in our experiments.We illustrate our computational method by exploratory studies on news corpora and scientific research papers.We not only confirm existing knowledge but also suggest hypotheses around the usage of arab and islam in terrorism and latino and asian in immigration.
It is important to note that the relations found using our approach depend on the nature of the representation of ideas and the source of texts.For instance, we cannot expect relations found in news articles to reflect shifts in public opinion if news articles do not effectively track public opinion.
Our method is entirely observational.It remains as a further stage of analysis to understand the underlying reasons that lead to these relations be-tween ideas.In scientific research, for example, it could simply be the progress of science, i.e., newer ideas overtake older ones deemed less valuable at a given time; on the other hand, history suggests that it is not always the correct ideas that are most expressed, and many other factors may be important.Similarly, in news coverage, underlying sociological and political situations have significant impact on which ideas are presented, and how.
There are many potential directions to improve our method to account for complex relations between ideas.For instance, we assume that both ideas and relations are statically grounded in keywords or topics.In reality, ideas and relations both evolve over time: a tryst relation might appear as friendship if we focus on a narrower time period.Similarly, new ideas show up and even the same idea may change over time and be represented by different words.

Figure 3 :
Figure3: Overall distributions of cooccurrence and prevalence correlation.In the main plot, each point represents a pair of ideas: color density shows the kernel density estimation of the joint distribution(Scott, 2015).The plots along the axes show the marginal distribution of the corresponding dimension.In each plot, we give the Pearson correlation, and all Pearson correlations' p-values are less than 10 −40 .In these plots, we use topics to represent ideas.

Figure 4 :
Figure 4: Collective strength of the four relation types in each dataset (news is the average of the news corpora and research is for ACL and NIPS).Fig.4a uses topics to represent ideas, while Fig. 4b uses keywords to represent ideas.Each bar presents the average strength of the top 25 pairs in a relation type in the corresponding dataset.Error bars represent standard errors calculated in the usual way, but note that since the top 25 pairs are not random samples, they cannot be interpreted in the usual way.
Figure5: Fig.5ashows the relations between the "federal, state" topic and four international topics.Edge colors indicate relation types and the number in an edge label presents the ranking of its strength in the corresponding relation type.Fig.5band Fig.5crepresent concrete examples in Fig.5a: "federal, state" and "afghanistan, taliban" follow similar trends, although "afghanistan, taliban" fluctuates over time due to significant events such as the September 11 attacks in 2001 and the death of Bin Laden in 2011; while "iran, lybia" is negatively correlated with "federal, state".In fact, more than 70% of terrorism news in the 80s contained the "iran, lybia" topic.

Figure 6 :
Figure6: Tryst relation between arab and islam using keywords to represent ideas (#2 in tryst): these two words tend to cooccur but are anti-correlated in prevalence over time.In particular, islam was rarely used in coverage of terrorism in the 1980s.

Figure 8 :
Figure 8: Top relations between the topics in ACL Anthology.The top 10 words for the rule, forest methods topic are rule, grammar, derivation, span, algorithm, forest, parsing, figure, set, string.

Figure 9 :
Figure9: Relations between topics in ACL Anthology in the space of cooccurrence and prevalence correlation (prevalence correlation is shown explicitly and cooccurrence is encoded in row captions), color coded to match the text.The y-axis represents the relative proportion of papers in a year that contain the corresponding topic.The top 10 words for the rule, forest methods topic are rule, grammar, derivation, span, algorithm, forest, parsing, figure, set, string.
rejects normality in almost all cases.Prevalence correlation exhibits diverse distributions.Pairwise prevalence correlation follows different distributions in news articles compared to research papers: they are unimodal in news articles, but not in ACL or NIPS.The dip test only rejects the unimodality hypothesis in NIPS.None follow normal distributions based on D'Agostino's K 2 test.