Friends with Motives: Using Text to Infer Influence on SCOTUS

We present a probabilistic model of the in-ﬂuence of language on the behavior of the U.S. Supreme Court, speciﬁcally inﬂuence of amicus briefs on Court decisions and opinions. The approach assumes that amici are rational, utility-maximizing agents who try to win votes or affect the language of court opinions. Our model leads to improved predictions of justices’ votes and perplexity of opinion language. It is amenable to inspection, allowing us to explore inferences about the persua-siveness of different amici and inﬂuenceability of different justices; these are consistent with earlier ﬁndings.


Introduction
The Supreme Court of the United States (SCOTUS), the highest court in the American judiciary, makes decisions with far-reaching effects. In a typical case, there are four participating parties: petitioners and respondents who file briefs arguing the merits of their sides of a case ("merits briefs"); third-party entities with an interest (but not a direct stake) in the case, who file amicus curiae 1 briefs to provide further arguments and recommendations on either side; and justices who, after oral arguments and discus-1 Amicus curiae is Latin for "friends of the court." Hereafter, we use amicus in singular and amici in plural to refer to these interested third parties. It is common for several amici to coauthor a single brief, which we account for in our model. sions, vote on the case and write "opinions" to explain the Court's decisions. 2 In recent years, amicus briefs are increasingly being employed as a lobbying tool to influence the Court's decision-making process (Franze and Anderson, 2015; Kearney and Merrill, 2000). The content of these briefs reveals explicit attempts to persuade justices and provides a fascinating setting for empirical study of influence through language. As such, we take the perspective of an amicus, proposing a probabilistic model of the various parties to a case that accounts for the amicus' goals.
Our model of SCOTUS is considerably more comprehensive than past work in political science, which has focused primarily on ideal point models that use votes as evidence. Text has been incorporated more recently as a way of making such models more interpretable, but without changing the fundamental assumptions (Lauderdale and Clark, 2014). Here, we draw on decision theory to posit amici as rational agents. We assume these amici-agents maximize their expected utility by framing their arguments to sway justices towards favorable outcomes.
We build directly on Sim et al. (2015), who used utility functions to explicitly model the goals of amici in a probabilistic setting. Their approach only considered amici in aggregate, inferring nothing about any specific amicus, such as experience or motivation for filing briefs. Here, we enrich their model to allow such analysis and also introduce Court opinions as evidence. By modeling the justices' author-ing process as well, we can capture an important aspect of amici's goals: influencing the text of the opinions.
In §3, we demonstrate the effectiveness of our approach on vote prediction and perplexity. Furthermore, we present analyses that reveal the persuasiveness of amici and influenceability of justices that are consistent with past findings.

Generative Models of SCOTUS
Our approach builds on a series of probabilistic models only recently considered in NLP research. To keep the discussion self-contained, we begin with classical models of votes alone and build up toward our novel contributions.

Modeling Votes
Ideal point (IP) models are a mainstay in quantitative political science, often applied to voting records to place voters (lawmakers, justices, etc.) in a continuous space. A justice's "ideal point" is a latent variable positioning him in this space. Martin and Quinn (2002) introduced the unidimensional IP model for judicial votes, which posits an IP ψ j ∈ R for each justice j. Often the ψ j values are interpreted as positions along a liberal-conservative ideological spectrum. Each case i is represented by popularity (a i ) and polarity (b i ) parameters. 3 A probabilistic view of the unidimensional IP model is that justice j votes in favor of case i's petitioner (as opposed to the respondent) with probability 1+exp(x) is the logistic function. When the popularity parameter a i is high enough, every justice is more likely to favor the petitioner. The polarity b i captures the importance of a justice's ideology: polarizing cases (i.e., |b i | 0) push justice j more strongly to the side of the petitioner (if b i has the same sign as ψ j ) or the respondent (otherwise).
Amici IP models. Sim et al. (2015) introduced a multidimensional IP model that incorporated text 3 This model is also known as a two parameter logistic model in item-response theory (Fox, 2010), where ai is "difficulty" and bi is "discrimination." from merits and amicus briefs as evidence. They inferred dimensions of IP that are grounded in "topical" space, where topics are learned using latent Dirichlet allocation (Blei et al., 2003). In their proposed model, the merits briefs describe the issues and facts of the case, while amicus briefs were hypothesized to "frame" the facts and potentially influence the outcome of the case. For case i and justice j, the vote probability is where A i is the set of amicus briefs filed on this case, s i,k denotes the side (∈ {petitioner, respondent}) supported by the kth brief, and c p i , and c r i are the amicus polarities for briefs on either side. The case IP is influenced by merits briefs (embedded in θ i ) and by the amicus briefs (embedded in ∆ i,k ), both of which are rescaled independently by the case discrimination parameters to generate the vote probability. The model assumes that briefs on the same side share a single embedding and that individual briefs on one side influence the vote-specific IP equally.
New IP model: Persuasive amici. Lynch (2004) and others have argued that some amici are more effective than others, with greater influence on justices. We therefore propose a new model which considers amici as individual actors. Starting from Eq. 1, we consider two additional variables: each amicus e's persuasiveness (π e > 0) and each justice j's influenceability (χ j > 0). 4 is the average of their πvalues, with E i,k denoting the set of entities who coauthored the kth amicus brief for case i.
Intuitively, a larger value of χ j will shift the case IP more towards the contents of the amicus briefs, thus making the justice seem more "influenced" by amicus. Likewise, briefs co-authored by groups of amici who are more effective (i.e., largerπ i,k ), will "frame" the case towards their biases. Unlike Sim et al. (2015), we eschew the amicus polarity parameters (c i ) and instead rely on the influenceability and persuasiveness parameters. Furthermore, we note that they performed a post-hoc analysis of amici influence on justices but we do so directly through χ j .
(e) For each participating justice j ∈ J i , draw vote v i,j according to Eq. 2.

Modeling Opinions
In most SCOTUS cases, a justice is assigned to author a majority opinion, and justices voting in the majority "join" in the opinion. Justices may author additional opinions concurring or dissenting with the majority, and they may choose to join concurring and dissenting opinions written by others. Here, we extend the IP model of votes to generate the opinions of a case; this marks the second major extension beyond the IP model of Sim et al. (2015). SCOTUS justices often incorporate language from merits (Feldman, 2016b;Feldman, 2016a) and amicus (Collins et al., 2015;Ditzler, 2011) briefs into their opinions. While amicus briefs are not usually used directly in legal analyses, the background and technical information they provide are often quoted in opinions. As such, we model opinions as a mixture of its justice-authors' topic preferences, topic proportions of the merits briefs (θ), and topic proportions of the amicus briefs (∆). This can also be viewed as an author-topic model (Rosen-Zvi et al., 2004) where justices, litigants, and groups of amici are all effective authors. To accomplish this, we introduce an explicit switching variable x for each word, which selects between the different sources of topics, to capture the mixture proportions.
Since any justice can author additional opinions explaining the rationale behind their votes, we concatenate all opinions supporting the same side of a case into a single document. 6 However, we note that concurring opinions often contain perspectives that are different from the majority opinion and by concatenating them, we may lose some information about individual justices' styles or preferences. Building on the generative model for votes, the generative story for each case i's two opinionsdocuments is: 5. For each justice j ∈ J , draw topics Γ j ∼ Dirichlet(α). 6. For each case i ∈ C: (a) For each side s ∈ {petitioner, respondent}, draw "author"-mixing proportions: where the last two dimensions are for choosing topics from the merits and amicus briefs, re-spectively. 7 Intuitively, our model assumes that opinions will incorporate more language from justices who agree with it. (b) For each side s ∈ {petitioner, respondent} and each word w i,s,n in the opinion for side s, Unlike in the Court, where an opinion is mainly authored by a single justice, all the participating justices contribute to an opinion in our generative story, with different proportions. This approach simplifies the computational model and reflects the closeddoor nature of discussions held by justices prior to writing their opinions. Our model assumes that justices debate together, and that the arguments are reflected in the final opinions. In future work, we might extend the model to infer an authoring process that separates an initial author from "joiners."

Amici Utility
Our approach assumes that amici are rational and purposeful decisionmakers who write briefs to influence the outcome of a case; this assumption leads to the design of the distribution over ∆ (generative model step 4(d)i). When writing a brief ∆, an amicus seeks to increase the response to her brief (i.e., votes), while keeping her costs low. We encode her objectives as a utility function, which she aims to maximize with respect to the decision variable ∆: where R(·) is the extrinsic response (reward) that an amicus gets from filing brief ∆ and C(·) is the "cost" of filing the brief; dependency on other latent 7 In cases where there are less than nine justices voting, the size of τ p i and τ r i may be smaller.
variables is notationally suppressed. When authoring her brief, we assume that the amicus writer has knowledge of the justices (IP and topic preferences), case parameters, and merits, but not the other amici participating in the case. 8 Amicus curiae are motivated to position themselves (through their briefs) in such a way as to improve the likelihood that their arguments will persuade SCOTUS justices. This is reflected in the way a justice votes or through the language of the opinions. Hence, we investigate two response functions. First, an amicus supporting side s seeks to win votes for s, (5) which is the expected number of votes for side s, under the model. This follows Sim et al. (2015).
An alternative is to maximize the (topical) similarity between her brief and the Court's opinion(s) siding with s, where H 2 (P, Q) = 1 2 √ P − √ Q 2 2 is the squared Hellinger (1909) distance between two distributions, and Ω s is the expected topic mixture under the model assumptions in §2.2 (which has a closed form). In short, the amicus gains utility by accurately predicting the expected opinion, thereby gaining publicity and demonstrating to members, donors, potential clients, and others that the language of the highly visible SCOTUS opinion was influenced. Both Eqs. 5 and 6 reward amici when justices "agree" with them, for different definitions of agreement.
We assume the cost C(∆) = H 2 (∆, θ), the squared Hellinger distance between the mixture proportions of the amicus brief and merits briefs. 9 The cost term defines the "budget" set of the amicus: briefs cannot be arbitrary text, as there is disutility or effort required to carefully frame a case, and monetary cost to hiring legal counsel. The key assumption is that framing is costly, while simply matching the merits is cheap (and presumably unnecessary).
Notationally, we use U vote to refer to models where Eq. 5 is in the utility function (in Eq. 4) and U opinion where it is Eq. 6.
Random utility models Recall our assumption that amici are purposeful writers whose briefs are optimized for their utility function. In an ideal setting, the ∆ which we observe will be utility maximizing. We simplify computation by assuming that these amici agents' preferences also contain an idiosyncratic random component that is unobserved to us. This is a common assumption in discrete choice models known as a "random utility model" (McFadden, 1974). We view the utility function as a prior on ∆, where our functional equations for utility imply −1 ≤ U (·) ≤ 1. η is a hyperparameter tuned using cross validation. The behavior which we observe (i.e., the amicus' topic mixture proportions) has a likelihood that is proportional to utility.

Parameter Estimation
The models we described above can be estimated within a Bayesian framework. We decoupled the estimation of the votes model from the opinions model; we first estimate the parameters for the votes model and hold them fixed while we estimate the new latent variables in the opinions model. In our preliminary experiments, we found that estimating parameters for both votes and opinions jointly led to slow mixing and poor predictive performance. Separating the estimation procedure into two stages allows the model to find better parameters for the votes model, which are then fed into the opinions model as priors through the vote probabilities. We used Metropolis within Gibbs, a hybrid MCMC algorithm, to sample the latent parameters from their posterior distributions (Tierney, 1994). 10 For the Metropolis-Hastings proposal distributions, we used a Gaussian for the case parameters a, b, and justice IPs ψ, log-normal distributions for χ and π, 10 The details of our sampler and hyperparameter settings can be found in §A and §B of the supplementary materials. and logistic-normal distribution for the variables on the simplex θ, ∆, τ , and Γ. We tuned the hyperparameters of the proposal distributions at each iteration to achieve a target acceptance rate of 15-45%. We used T = 128 topics for model and initialized topic proportions (θ, ∆) and topic-word distributions (φ) using online LDA (Hoffman et al., 2010).

Experiments
Data. In our experiments, we use SCOTUS cases between 1985-2014; votes and metadata are from Spaeth et al. (2013) and brief texts come from Sim et al. (2015). We concatenate each of the 2,643 cases' merits briefs from both parties to form a single document, where the text is used to infer the representation of the case in topical space (θ; i.e., merits briefs are treated as "facts of the case"). Likewise, opinions supporting the same side of the case (i.e., majority and concurring vs. dissents) were concatenated to form a single document. In our dataset, the opinions are explicitly labeled with the justice who authored them (as well as other justices who decide to "join" it).
As the amicus briefs in the dataset were not explicitly labeled with the side that they support, Sim et al. (2015) built a binary classifier with bag-ofn-gram features that took advantage of cues in the brief content that strongly signal the side that the amici supports (e.g., "in support of petitioner"). We used their classifier to label the amici's supporting side. Additionally, we created regular expression rules to identify and standardize amicus authors from the header of briefs. We filtered amici who have participated in fewer than 5 briefs 11 and merged regional chapters of amicus organizations together (i.e., "ACLU of Kansas" and "ACLU of Kentucky" are both labeled "ACLU"). On the other hand, we separated labeled amicus briefs by the U.S. Solicitor General according to the presidential administration when the brief is filed (i.e., an amicus brief filed during Obama's administration will be labeled "USSG-Obama"). The top three amici by number of briefs filed are American Civil Liberties Union (463), Utah (376), and National Asso- ciation of Criminal Defense Lawyers (359). We represent a document as a bag of n-grams with part of speech tags that follow the simple but effective pattern (Adjective|Cardinal|Noun)+ Noun (Justeson and Katz, 1995). We filter phrases appearing fewer than 100 times or in more than 8,500 documents, obtaining a final set of 48,589 phrase types. Table 1 summarizes the details of our corpus.
Predicting Votes. We quantify the performance of our vote model using 5-fold cross validation and on predicting future votes from past votes. The utility function in the vote model uses the response function in Eq. 5. Due to the specification of IP models, we need the case parameters of new cases to predict the direction of the votes. Gerrish and Blei (2011) accomplished this by using regression on legislative text to predict the case parameters (a, b). Here, we follow a similar approach, fitting ridge regression models on the merits brief topic mixtures θ to predict a and b for each case. 12 On the held-out test cases, we sampled the mixture proportions for the merits and amicus briefs directly using latent Dirichlet allocation with parameters learned while fitting our vote model. With the parameters from our fitted vote model and ridge regression, we can predict the votes of every justice for every case.
We compared the performance of our model with two strong baselines: (i) a random forest trained on case-centric metadata coded by Spaeth et al. (2013) to make predictions on how justices would vote (Katz et al., 2014) and (ii) Sim et al. (2015)'s amici IP model, which uses amicus briefs and their version of utility; it is a simpler version of our vote model that does not consider the persuasiveness of different amici or the influenceability of different justices. For prediction in Sim et al. (2015), we used the same approach described above to estimate the case parameters a, b, and regressing on amicus brief topics (∆) instead for amicus polarities c p and c r . Table 2 Model  shows performance on vote prediction. We evaluated the models using 5-fold cross validation, as well as on forecasting votes in 2013 and 2014 (trained using data from 1985 to the preceding year). Our model outperformed the baseline models. The improvement in accuracy over Sim et al. (2015) is small; most likely because both models are very similar, the main difference being the parametrization of amicus briefs. In the 2013 test set, the distribution of votes is significantly skewed towards the petitioner (compared to the training data), which resulted in the most frequent class classifier performing much better than everything else. Fig. 1 illustrates our model's estimated ideal points for selected topics.
Predicting Opinions. We also estimated the opinion model using the utility function with response function in Eq. 6. We use perplexity as a proxy to measure the opinion content predictive ability of our model. Perplexity on a test set is commonly used to quantify the generalization ability of probabilistic models and make comparisons among models over the same observation space. For a case with opinion w supporting side s, the perplexity is defined as where N is the number of tokens in the opinion and a lower perplexity indicates better generalization performance. The likelihood term can be approximated using samples from the inference step. Table 3 shows the perplexity of our model on opinions in the test set. As described in §2.4, we learn the vote model in the first stage before estimating the opinion model. Here, we compare our model against using vote models that do not include U vote to evaluate the sensitivity of our opinion 120: marriage, same sex, man Figure 1: Justices' ideal points for selected topics. Justices whose topic IPs are close to each other are more likely to vote in the same direction on cases involving those topics. The IP estimated by our model is consistent with publicly available knowledge regarding justices' ideological stances on these issues.
model to the vote model parameters. Additionally, we compared against two baselines trained on just the opinions: one using LDA 13 and another using the author-topic model (Rosen-Zvi et al., 2004). For the author-topic model, we treat each opinion as being "authored" by the participating justices, a pseudoauthor representing the litigants which is shared between opinions in a case, and a unique amicus author for each side. Our model with U opinion achieves better generalization performance than the simpler baselines, while we do not see significant differences in whether the first stage vote models use U vote . This is not surprising since the vote model's results are similar with or without U vote and it influences the opinion model indirectly through priors and U opinion . In our model, the latent variable Γ j captures the proportion of topics that justice j is likely to contribute to an opinion. When j has a high probability of voting for a particular side, our informed prior increases the likelihood that j's topics will be selected for words in the opinion. While Γ j serves a similar purpose to ψ j in characterizing j through her ideological positions, ψ j relies on votes and gives us a "direction" of j's ideological standing, whereas Γ j is estimated from text produced by the justices and only gives us the "magnitude" of her tendency to author on a particular issue. In Table 4, we identify the top topics in Γ j by considering the deviation from the mean of all justice's Γ, i.e., Γ j,k − 1 |J | j Γ j,k .
Amici Persuasiveness. The latent variable π e captures the model's belief about amicus e's brief's ef- 13 We used scikit-learn's LDA module (Pedregosa et al., 2011) which implements the online variational Bayes algorithm (Hoffman et al., 2010).  fect on the case IP, which we call "persuasiveness." A large π e indicates that across the dataset, e exerts a larger effect on the case IPs, that is, according to our model, she has a larger impact on the Court's decision than other amici. Fig. 2 is a swarm plot illustrating the distribution of π values for different types of amicus writers. Our model infers that governmental offices tend to have larger π values than private organizations, especially the U.S. Solicitor General. 14 In fact, Lynch (2004) found through interviews with SCOTUS law clerks that "amicus briefs from the solicitor general are 'head and shoulders' above the rest, and are often considered more carefully than party briefs." Another interesting observation from Fig. 2 is the low π value for ACLU and ABA, despite being prolific amicus brief filers. While it is tempting to say that amici with low π values are ineffective, we find that there is almost no correlation between π and the proportion of cases where they were on the winning side. 15 Note that our model does not assume that a John G. Roberts 32: speech, first amendment, free speech, message, expression 61: eeoc, title vii, discrimination, woman, civil rights act 52: sec, fraud, security, investor, section ##b Ruth B. Ginsburg 61: eeoc, title vii, discrimination, woman, civil rights act 80: class, settlement, rule ##, class action, r civ 96: taxpayer, bank, corporation, fund, irs Antonin Scalia 94: 42 USC 1983, qualified immunity, immunity, official, section #### 57: president, senate, executive, article, framer 80: class, settlement, rule ##, class action, r civ "persuasive" amicus tends to win. Instead, an amicus with large π will impact the case IP most, and thus explain a justice's vote or opinion (even dissenting) more than the other components in a case.
Insofar as π explains a vote, we must exercise caution; it is possible that the amicus played no role in the decision-making process and the values of π e simply reflect our modeling assumptions and/or artifacts of the data. Without entering the minds of SCOTUS justices, or at least observing their closeddoor deliberations, it is difficult to measure the influence of amicus briefs on justices' decisions.
Justice Influenceability. The latent variable χ j measures the relative effect of amicus briefs on justice j's vote IP; when χ j is large, justice j's vote probability is affected by amicus briefs more. Since χ j is shared between all cases that a justice participates in, χ j should correspond to how much they value amicus briefs. Some justices, such as the late Scalia, are known to be dubious of amicus briefs, preferring to leave the task of reading these briefs to their law clerks, who will pick out any notable briefs for them; we would expect Scalia to have a smaller χ than other justices. In Table 5, we compare the χ values of justices with how often they cite an amicus brief in any opinion they wrote (Franze and Anderson, 2015). The χ values estimated by our model are sides is −0.0549. On average, an amicus supports the winning side in 55% of cases. For the ACLU, ABA, CAC, and CWFA, the proportions are 44%, 50%, 47%, and 50% respectively. consistent with our expectations. 16 We note that the χ values correlate considerably with the general ideological leanings of the justices. This might be a coincidence or an inability of the model's specification to discern between ideological extremeness and influenceability. econometrics estimates structural utility-based decisions (Berry et al., 1995, inter alia).

Conclusion
We presented a random utility model of the Supreme Court that is more comprehensive than earlier work. We considered an individual amicus' persuasiveness and motivations through two different utility functions. On the vote prediction task, our results are consistent with earlier work, and we can infer and compare the relative effectiveness of an individual amicus. Moreover, our opinions model and opinion utility function achieved better generalization performance than simpler methods.