A Utility Model of Authors in the Scientific Community

Authoring a scientific paper is a complex process involving many decisions. We introduce a probabilistic model of some of the important aspects of that process: that authors have individual preferences, that writing a paper requires trading off among the preferences of authors as well as extrinsic rewards in the form of community response to their papers, that preferences (of individuals and the community) and tradeoffs vary over time. Variants of our model lead to improved predictive accuracy of citations given texts and texts given authors. Further, our model’s posterior suggests an interesting relationship between seniority and author choices.


Introduction
Why do we write? As researchers, we write papers to report new scientific findings, but this is not the whole story. Authoring a paper involves a huge amount of decision-making that may be influenced by factors such as institutional incentives, attention-seeking, and pleasure derived from research on topics that excite us.
We propose that text collections and associated metadata can be analyzed to reveal optimizing behavior by authors. Specifically, we consider the ACL Anthology Network Corpus (Radev et al., 2013), along with author and citation metadata. Our main contribution is a method that infers two kinds of quantities about an author: her associations with interpretable research topics, which might correspond to relative expertise or merely to preferences among topics to write about; and a tradeoff coefficient that estimates the extent to which she writes papers that will be cited versus papers close to her preferences.
The method is based on a probabilistic model that incorporates assumptions about how authors decide what to write, how joint decisions work when papers are coauthored, and how individual and community preferences shift over time. Central to our model is a low-dimensional topic representation shared by authors (in defining preferences), papers (i.e., what they are "about"), and the community as a whole (in responding with citations). This method can be used to make predictions; empirically, we find that: 1. topics discovered by generative models outperform a strong text regression baseline (Yogatama et al., 2011) for citation count prediction; 2. such models do better at that task without modeling author utility as we propose; and 3. the author utility model leads to better predictive accuracy when answering the question, "given a set of authors, what are they likely to write?" This method can also be used for exploration and to generate hypotheses. We provide an intriguing example relating author tradeoffs to age within the research community.

Notation and Representations
In the following, a document d will be represented by a vector θ d ∈ R K . The dimensions of this vector might correspond to elements of a vocabulary, giving a "bag of words" encoding; in this work they correspond to latent topics. Document d is assumed to elicit from the scientific community an observable response y d , which might correspond to the number of citations (or downloads) of the paper.
Each author a is associated with a vector η a ∈ R K , with dimensions indexed the same as documents. Below, we will refer to this vector as a's "preferences," though it is important to remember that they could also capture an author's expertise, and the model makes no attempt to distinguish between them. We use "preferences" because it is a weaker theoretical commitment.

Modeling Utility
Our main assumption about author a is that she is an optimizer: when writing document d she seeks to increase the response y d while keeping the contents of d, θ d , "close" to her preferences η a . We encode her objectives as a utility function to be maximized with respect to θ d : where d,a is an author-paper-specific idiosyncratic randomness that is unobserved to us but assumed known to the author. (This is a common assumption in discrete choice models. It is often called a "random utility model.") Notice the tradeoff between maximizing the response y d and staying close to one's preferences. We capture these competing objectives by formulating the latter as a squared Euclidean distance between η a and θ d , and encoding the tradeoff between extrinsic (citation-seeking) and intrinsic (preference-satisfying) objectives as the (positive) coefficient κ a . If κ a is large, a might be understood as a citation-maximizing agent; if κ a is small, a might appear to care much more about certain kinds of papers (η a ) than about citation. This utility function considers only two particular facets of author writing behavior; it does not take into account other factors that may contribute to an author's objective. For this reason, some care is required in interpreting quantities like κ a . For example, divergence between a particular η a and θ d might suggest that a is open to new topics, not merely hungry for citations. Other motivations, such as reputation (notoriously difficult to measure), funding maintenance, and the preferences of peer referees are not captured in this model. Similarly for preferences η a , a large value in this vector might reflect a's skill or the preferences of a's sponsors rather than a's personal interest the topic. Next, we model the response y d . We assume that responses are driven largely by topics, with some noise, so that where ξ d ∼ N (0, 1). Because the community's interest in different topics varies over time, β is given temporal dynamics, discussed in §3.4. Under this assumption, the author's expected utility assuming she is aware of β (often called "rational expectations" in discrete choice models), is: (This is obtained by plugging the expected value of y d , from Eq. 2, into Eq. 1.) An author's decision will therefore bê Optimality implies thatθ d solves the first-order equations κ a β j − (θ d,j − (η a,j + d,a,j )) = 0, ∀1 ≤ j ≤ K (5) Eq. 5 highlights the tradeoff the author faces: when β j > 0, the author will write more on θ d,j , while straying too far from η a,j incurs a penalty.

Modeling Coauthorship
Matters become more complicated when multiple authors write a paper together. Suppose the document d is authored by set of authors a d . We model the joint expected utility of a d in writing θ d as the average of the group's utility. 1 where the "cost" term is scaled by c d,a , denoting the fractional "contribution" of author a to document d. Thus, a∈a d c d,a = 1, and we treat c d as a latent categorical distribution to be inferred. The first-order equation becomes

Modeling Document Content
As noted before, there are many possible ways to represent and model document content θ d . We treat θ d as (an encoding of) a mixture of topics. Following considerable past work, a "topic" is defined as a categorical distribution over observable tokens (Blei et al., 2003;Hofmann, 1999). Let w d be the observed bag of tokens constituting document d. We assume each token is drawn from a mixture over topics: where N d is the number of tokens in document d, z d,i is the topic assignment for d's ith token w d,i , and φ 1 , . . . , φ K are topic-term distributions. Note that θ d ∈ R K ; we define p(z | θ d ) as a categorical draw from the softmax-transformed θ d (Blei and Lafferty, 2007). Using topic mixtures instead of a bag of words provides us with a low-dimensional interpretable representation that is useful for analyzing authors' behaviors and preferences. Each dimension j of an author's preference is grounded in topic j. If we ignore document responses, this component of model closely resembles the author-topic model (Rosen-Zvi et al., 2004), except that we assume a different prior for the topic mixtures.

Modeling Temporal Dynamics
Individual preferences shift over time, as do those of the research community. We extend our model to allow variation at different timesteps. Let t ∈ 1, . . . , T index timesteps (in our experiments, each t is a calendar year). We let β (t) , η (t) a , and κ (t) a denote the community's response coefficients, author a's preferences, and author a's tradeoff coefficient at timestep t.
Again, we must take care in interpreting these quantities. Do changes in community interest drive authors to adjust their preferences or expertise? Or do changing author preferences aggregate into community-wide shifts? Or do changes in the economy or funding availability change authors' tradeoffs? Our model cannot differentiate among these different causal patterns. Our method is useful for tracking these changes, but it does not provide an explanation for why they take place.
Modeling the temporal dynamics of a vectorvalued random variable can be accomplished us-ing a multivariate Gaussian distribution. Following Yogatama et al. (2011), we assume the prior for β The two hyperparameters α and λ capture, respectively, autocorrelation (the tendency of β (t+1) j to be similar to β (t) j ) and overall variance. This approach to modeling time series allows us to capture temporal dynamics while sharing statistical strength of evidence across all time steps.
We use the notation T (λ, α) ≡ N (0, Λ(λ, α)) for this multivariate Gaussian distribution, instances of which are used as priors over response coefficients β, author preferences η a , and (transformed) author tradeoffs log κ a .

Full Model
Table 1 summarizes all of the notation. The loglikelihood of our model is: We adopt a Bayesian approach to parameter estimation. The generative story, including all priors, is as follows. Recall that T (·, ·) denotes the time series prior discussed in §3.4. See also the plate diagram for the graphical model in Fig. 1.
1. For each topic k ∈ {1, . . . , K}: (a) Draw response coefficients β . This is known as a logistic normal distribution (Aitchison, 1986). (b) Draw d's topic distributions (this distribution is discussed further below): note that it collapses out ξ d , which is drawn from a standard normal.
Eq. 9 captures the choice by authors a d of a distribution over topics θ d . Assuming that the d,a s are i.i.d. and Gaussian, from Eq. 7, we get In §3.1 we described a utility function for each author. The model we are estimating is similar to those estimated in discrete choice econometrics (McFadden, 1974). We assumed that authors are utility maximizing (optimizing) and that their optimal topic distribution satisfies the first-order conditions (Eq. 7). However, we cannot see the idiosyncratic component, d,a , which is assumed to be Gaussian; as noted, this is known as a random utility model. Together, these assumptions give the structure of the distribution over topics in terms of (estimated) utility, which allows us to naturally incorporate the utility function into our probabilistic model in a familiar way (Sim et al., 2015).

Learning and Inference
Exact inference in our model is intractable, so we resort to an approximate inference technique based on Monte Carlo EM (Wei and Tanner, 1990). During the E-step, we perform Bayesian inference over latent parameters (η, κ, z, θ, c, φ) using a Metropolis-Hastings within Gibbs algorithm (Tierney, 1994), and in the M-step, we compute maximum a posteriori estimates of β by directly optimizing the log-likelihood function. Since we are using conjugate priors for φ, we can integrate it out. We did not perform Bayesian posterior inference over β because the coupling of β would slow mixing of the MCMC chain.
a , and θ d blockwise using the Metropolis-Hastings algorithm with a multivariate Gaussian proposal distribution, tuning the diagonal covariance matrix to a target acceptance rate of 15-45% (see appendix §A for sampling equations).
For z, we integrate out φ and sample each z d,i directly from are the number of times w is associated with topic k, and the number of tokens associated with topic k respectively.
We run the E-step Gibbs sampler to collect 3,500 samples, discarding the first 500 samples for burn-in and only saving samples at every third iteration.

M-step.
We approximate the expectations of our latent variables using the samples collected during the E-step, and directly optimize β (t) using L-BFGS (Liu and Nocedal, 1989), 2 which requires a gradient. The gradient of the log-likelihood with respect to β We ran L-BFGS until convergence 3 and slice sampled the hyperparameters λ (η) , α (η) , λ (κ) , α (κ) (with vague priors) at the end of the M-step. We fix the symmetric Dirichlet hyperparameter ρ = 1/V , and tuned λ (β) , α (β) on a held-out developement dataset by grid search over {0.01, 0.1, 1, 10}. During initialization, we randomly set the topic assignments, while the other latent parameters are set to 0. We ran the model for 10 EM iterations.
Inference. During inference, we fix the model parameters and only sample (θ, z) for each document. As in the E-step, we discard the first 500 samples, and save samples at every third iteration, until we have 500 posterior samples. In our experiments, we found the posterior samples to be reasonably stable after the initial burn in.

Experiments
Data. The ACL Anthology Network Corpus contains 21,212 papers published in the field of computational linguistics between 1965 and 2013 and written by 17,792 authors. Additionally, the corpus provides metadata such as authors, venue and in-community citation networks. For our experiments, we focused on conference papers published between 1980 and 2010. 4 We tokenized the texts, tagged the tokens using the Stanford POS tagger (Toutanova et al., 2003), and extracted ngrams with tags that follow the simple (but effective) pattern of (Adj|Noun) * Noun (Justeson and Katz, 1995), representing the dth document as a bag of phrases (w d ). Note that phrases can also be unigrams. We pruned phrases that appear in < 1% or > 95% of the documents, obtaining a vocabulary of V = 6,868 types. The pruned corpus contains 5,498 documents and 2,643,946 phrase tokens written by 5,575 authors. We let responses y d = log(1 + # of incoming citations in 3 years) For our experiments, we used 3 different random splits of our data (70% train, 20% test, and 10% development) and averaged quantities of interest. Furthermore, we remove an author from a paper in the development or test set if we have not seen him before in the training data. Table 2 illustrates ten manually selected topics (out of 64) learned by the author utility model. Each topic is labeled with the top 10 words most likely to be generated conditioned on the topic (φ k ). For each topic, we compute an author's topic preference score:

Examples of Authors and Topics
The TPS scales the author's η preferences by the relative number of citations that the author received for the topic. This way, we can account for different ηs over time, and reduce variance due to authors who publish less frequently. 5 For each topic, the five authors with the highest TPS are displayed in the rightmost column of Table 2. These topics were among the roughly one third (out of 64) that seemed to coherently map to research topics within NLP. Some others corresponded to parts of a paper (e.g., explaining notation and formulae, experiments) or to stylistic groups (e.g., "rational words" including rather, fact, clearly, argue, clear, perhaps). Others were not interpretable to us.

Predicting Responses
We compare against two baselines for predicting in-community citations. Yogatama et al. (2011) is a strong baseline for predicting responses; they incorporated n-gram features and metadata features in a generalized linear model with the time series prior discussed in §3.4. 6 We also compare against a version of our model without the author utility component. This equates to replacing Yogatama et al.'s features with LDA topic mixtures, and performing joint learning of the topics and citations; we therefore call it "TimeLDA." Without the time series component, TimeLDA would instantiate supervised LDA (McAuliffe and Blei, 2008). Figure 2 shows the mean absolute error (MAE) for the three models.
With sufficiently many topics (K ≥ 16), topic representations achieve lower error than surface features. Removing the author utility component from our model leads to better predictive performance. This is unsurprising, since our model forces β to explain both the responses (what is 5 The TPS is only a measure of an author's propensity to write papers in a specific topic area and is not meant to be a measure of an author's reputation in a particular research sub-field. 6 For the ACL dataset, Yogatama et al. (2011)'s model predicts whether a paper will receive at least 1 citation within three years, while here, we train it to predict log(1 + #citations) instead. for predicted citation counts (y-axis) against the number of topics K (x-axis). Errors are in actual citation counts, while the models are trained with log counts. TimeLDA significantly outperforms Yogatama et al. (2011) for K ≥ 64 (paired t-test, p < 0.01), while the differences between Yogatama et al. (2011) and author utility are not significant. The MAE is calculated over 3 random splits of the data with 809, 812, and 811 documents in the test set respectively. evaluated here) and the divergence between author preferences η a and what is actually written. The utility model is nonetheless competitive with the Yogatama et al. baseline.

Predicting Words
"Given a set of authors, what are they likely to write?" -we use perplexity as a proxy to measure the content predictive ability of our model. Perplexity on a test set is commonly used to quantify the generalization ability of probabilistic models and make comparisons among models over the same observation space. For a document w d written by authors a d , perplexity is defined as and a lower perplexity indicates better generalization performance. Using S samples from the inference step, we can compute where θ s is the sth sample of θ, and φ s is the topic-word distribution estimated from the sth sample of z.
We compared the Author-Topic model of Rosen-Zvi et al. (2004). The AT model is similar to setting κ a = 0 for all authors, c d = 1 |a d | , and using a Dirichlet prior instead of logistic normal on η a . Figure 3 present the perplexity of these  models at different values of K. We include a version of our author utility model that ignores temporal information ("-time"), i.e., setting T = 1 and collapsing all timesteps. We find that perplexity improves with the addition of the utility model as well as the temporal dynamics.

Exploration: Tradeoffs and Seniority
Recall that κ a encodes author a's tradeoff between increasing citations (high κ a ) and writing papers on topics a prefers (low κ a ). We do not claim that individual κ a values consistently represent authors' tradeoffs between citations and writing about preferred topics. We have noted a number of potentially confounding factors that affect authors' choices, for which our data do not allow us to control.
However, in aggregate, κ a values can be explored in relation to other quantities. Given our model's posterior, one question we can ask is: do an author's tradeoffs tend to change over the course of her career? In Figure 4, we plot the median of κ (and 95% credible intervals) for authors at different "ages." Here, "age" is defined as the number of years since an author's first publication in this dataset. 7 A general trend over the long term is observed: researchers appear to move from higher to lower κ a . Statistically, there is significant dependence between κ of an author and her age; the Spearman's rank correlation coefficient is ρ = −0.870 with p-value < 10 −5 . This finding is consis-tent with the idea that greater seniority brings increased and more stable resources and greater freedom to pursue idiosyncratic interests with less concern about extrinsic payoff. It is also consistent with decreased flexibility or openness to shifting topics over time.
To illustrate the importance of our model in making these observations, we also plot the mean number of citations per paper published (across all authors) against their academic age (magenta lines). There is no clear statistical trend between the two variables (ρ = −0.017). This suggests that through κ, our model is able to pick up evidence of author's optimizing behaviors, which is not possible using simple citation counts.
There is a noticeable effect during years 5-10, in which κ tends to rise by around 40% and then fall back. (Note that the model maintains considerable uncertainty-wider intervals-about this effect.) Recall that, for a researcher trained within the field and whose primary publication venue is in the ACL community, our measure of age corresponds roughly to academic age. Years 5-10 would correspond to the later part of a Ph.D. program and early postgraduate life, when many researchers begin faculty careers. Insofar as it reflects a true effect, this rise and fall suggests a stage during which a researcher focuses more on writing papers that will attract citations. However, more in-depth study based on data that is not merely observational is required to quantify this effect and, if it persists under scrutiny, determine its cause.
The effect in year 24 of mean citations per paper (magenta line) can be attributed to well cited papers co-authored by senior researchers in the field who published very few papers in their 24th year. Since there are relatively few authors in the dataset at that academic age, there is more variance in mean citations counts.

Related Work
Previous work on modeling author interests mostly focused on characterizing authors by their style (Holmes and Forsyth, 1995, inter alia), 8 through latent topic mixtures of documents they have co-authored (Rosen-Zvi et al., 2004) and their collaboration networks (Johri et al., 2011). Like our paper, the latter two are based on topic models, which have been popular for modeling the content of scientific articles. For instance, Gerrish and Blei (2010) measured scholarly impact using dynamic topic models, while Hall et al. (2008) analyzed the output of topic models to study the "history of ideas." Predicting responses to scientific articles was explored in two shared tasks at KDD Cup 2003 (Brank and Leskovec, 2003;McGovern et al., 2003) and by Yogatama et al. (2011), which served as a baseline for our experiments and whose timeseries prior we used in our model. Furthermore, there has been considerable research using topic models to predict (or recommend) citations (instead of aggregate counts), such as modeling link probabilities within the LDA framework (Cohn and Hofmann, 2000;Erosheva et al., 2004;Nallapati and Cohen, 2008;Kataria et al., 2010;Zhu et al., 2013) and augmenting topics with discriminative author features (Liu et al., 2009;Tanner and Charniak, 2015).
We modeled both interests of authors and responses to their articles jointly, by assuming authors' text production is an expected utilitymaximizing decision. This approach is similar to our earlier work (Sim et al., 2015), where authors are rational agents writing texts to maximize the chance of a favorable decision by a judicial court. In that study, we did not consider the unique preferences of each decision making agent, nor the extrinsic-intrinsic reward tradeoffs that these agents face when authoring a document.
Our utility model can also be viewed as a form of natural language generator, where we take into account the context of an author (i.e., his preferences, the tradeoff coefficient, and what is popular) to generate his document. This is related to natural language pragmatics, where text is influenced by context. 9 Hovy (1990) approached the problem of generating text under pragmatic circumstances from a planning and goal-orientation perspective, while Vogel et al. (2013) used multiagent decision-theoretic models to show cooperative pragmatic behavior. Vogel et al.'s models suggest an interesting extension of ours for future work: modeling cooperation among co-authors and, perhaps, in the larger scientific discourse.

Conclusions
We presented a model of scientific authorship in which authors trade off between seeking citation by others and staying true to their individual preferences among research topics. We find that topic modeling improves over state-of-the-art text regression models for predicting citation counts, and that the author utility model generalizes better than simpler models when predicting what a particular group of authors will write. Inspecting our model suggests interesting patterns in behavior across a researcher's career.