Gov2Vec: Learning Distributed Representations of Institutions and Their Legal Text

We compare policy differences across institutions by embedding representations of the entire legal corpus of each institution and the vocabulary shared across all corpora into a continuous vector space. We apply our method, Gov2Vec, to Supreme Court opinions, Presidential actions, and official summaries of Congressional bills. The model discerns meaningful differences between government branches. We also learn representations for more fine-grained word sources: individual Presidents and (2-year) Congresses. The similarities between learned representations of Congresses over time and sitting Presidents are negatively correlated with the bill veto rate, and the temporal ordering of Presidents and Congresses was implicitly learned from only text. With the resulting vectors we answer questions such as: how does Obama and the 113th House differ in addressing climate change and how does this vary from environmental or economic perspectives? Our work illustrates vector-arithmetic-based investigations of complex relationships between word sources based on their texts. We are extending this to create a more comprehensive legal semantic map.


Introduction
Methods have been developed to efficiently obtain representations of words in R d that capture subtle semantics across the dimensions of the vectors (Collobert and Weston, 2008). For instance, after suf- * Forthcoming paper in the 2016 Proceedings of Empirical Methods in Natural Language Processing Workshop on Natural Language Processing and Computational Social Science. ficient training, relationships encoded in difference vectors can be uncovered with vector arithmetic: vec("king") -vec("man") + vec("woman") returns a vector close to vec("queen") (Mikolov et al. 2013a).
Applying this powerful notion of distributed continuous vector space representations of words, we embed representations of institutions and the words from their law and policy documents into shared semantic space. We can then combine positively and negatively weighted word and government vectors into the same query, enabling complex, targeted and subtle similarity computations. For instance, which government branch is more characterized by "validity and truth," or "long-term government career"?
We apply this method, Gov2Vec, to a unique corpus of Supreme Court opinions, Presidential actions, and official summaries of Congressional bills. The model discerns meaningful differences between House, Senate, President and Court vectors. We also learn more fine-grained institutional representations: individual Presidents and Congresses (2-year terms). The method implicitly learns important latent relationships between these government actors that was not provided during training. For instance, their temporal ordering was learned from only their text. The resulting vectors are used to explore differences between actors with respect to policy topics.

Methods
A common method for learning vector representations of words is to use a neural network to predict a target word with the mean of its context words' vectors, obtain the gradient with back-propagation of the prediction errors, and update vectors in the with word and Gov prediction, we set "Gov window size" to 1, e.g. a Congress is used to predict those directly before and after. direction of higher probability of observing the correct target word (Bengio et al. 2003;Mikolov et al. 2013b). After iterating over many word contexts, words with similar meaning are embedded in similar locations in vector space as a by-product of the prediction task (Mikolov et al. 2013b). Le and Mikolov (2014) extend this word2vec method to learn representations of documents. For predictions of target words, a vector unique to the document is concatenated with context word vectors and subsequently updated. Similarly, we embed institutions and their words into a shared vector space by averaging a vector unique to an institution with context word vectors when predicting that institution's words and, with back-propagation and stochastic gradient descent, update representations for institutions and the words (which are shared across all institutions). 1 There are two hyper-parameters for the algorithm that can strongly affect results, but suitable values are unknown. We use a tree of Parzen estimators search algorithm (Bergstra et al. 2013) to sample from parameter space 2 and save all models estimated. Subsequent analyses are conducted across all models, propagating our uncertainty in hyper- 1 We use a binary Huffman tree (Mikolov et al. 2013b) for efficient hierarchical softmax prediction of words, and conduct 25 epochs while linearly decreasing the learning rate from 0.025 to 0.001. 2 vector dimensionality, uniform(100, 200), and maximum distance between the context and target words, uniform (10,25) parameters. Due to stochasticity in training and the uncertainty in the hyper-parameter values, patterns robust across the ensemble are more likely to reflect useful regularities than individual models.
Gov2Vec can be applied to more fine-grained categories than entire government branches. In this context, there are often relationships between word sources, e.g. Obama after Bush, that we can incorporate into the learning process. During training, we alternate between updating GovVecs based on their use in the prediction of words in their policy corpus and their use in the prediction of other word sources located nearby in time. We model temporal institutional relationships, but any known relationships between entities, e.g. ranking Congresses by number of Republicans, could also be incorporated into the Structured Gov2Vec training process ( Fig. 1).
After training, we extract (M + S) × d j × J parameters, where M is the number of unique words, S is the number of word sources, and d j the vector dimensionality, which varies across the J models (we set J = 20). We then investigate the most cosine similar words to particular vector combi- , w i is one of W WordVecs or GovVecs of interest, V 1:N are the N most frequent words in the vocabulary of M words (N < M to exclude rare words during analysis) excluding the W query words, s i is 1 or -1 for whether we're positively or negatively weighting w i . We repeat similarity queries over all J models, retain words with > C cosine similarity, and rank the word results based on their frequency and mean cosine similarity across the ensemble. We also measure the similarity of WordVec combinations to each GovVec and the similarities between GovVecs to validate that the process learns useful embeddings that capture expected relationships.

WordVec-GovVec Similarities
We tested whether our learned vectors captured meaningful differences between branches. Fig. 2 displays similarities between these queries and the branches, which reflect a priori known differences.
Gov2Vec has unique capabilities that summary statistics, e.g. word frequency, lack: it can compute similarities between any source and word as long as the word occurs at least in one source, whereas word counting cannot provide meaningful similarities when a word never occurs in a source's corpus. Most importantly, Gov2Vec can combine complex combinations of positively and negatively weighted vectors in a similarity query.

GovVec-GovVec Similarities
We learned representations for individual Presidents and Congresses by using vectors for these higher resolution word sources in the word prediction task. To investigate if the representations capture important latent relationships between institutions, we compared the cosine similarities between the Congresses over time (93rd-113th) and the corresponding sitting Presidents (Nixon-Obama) to the bill veto rate. We expected that a lower veto rate would be reflected in more similar vectors, and, indeed, the Congress-President similarity and veto rate are negatively correlated (Spearman's ρ computed on raw   As a third validation, we learn vectors from only text and project them into two dimensions with principal components analysis. From Fig. 4 it's evident that temporal and institutional relationships were implicitly learned. 4 One dimension (top-tobottom) almost perfectly rank orders Presidents and Congresses by time, and another dimension (side-toside) separates the President from Congress.

Fig. 5 (top) asks: how does Obama and the 113th
House differ in addressing climate change and how does this vary across environmental and economic contexts? The most frequent word across the ensemble (out of words with > 0.35 similarity to the query) for the Obama-economic quadrant is "unprecedented." "Greenhouse" and "ghg" are more frequent across models and have a higher mean similarity for Obama-Environmental than 113th House-Environmental. Fig. 5 (bottom) asks: how does the House address war from "oil" and "terror" perspectives and how does this change after the 2001 terrorist attack. 5 Compared to the 106th, both the oil and terrorism panels in the 107th (when 9-11 occurred) have words more cosine similar to the query (further to the right) suggesting that the 107th House was closer to the topic of war, and the content changes to primarily strong verbs such as instructs, directs, requires, urges, and empowers.

Additional Related Work
Political scientists model text to understand political processes (Grimmer 2010;Roberts et al. 2014); however, most of this work focuses on variants of topic models (Blei et al. 2003). Djuric et al. (2015) apply a learning procedure similar to Structured Gov2Vec to streaming documents to learn representations of documents that are similar to those nearby in time. Structured Gov2Vec applies this joint hierarchical learning process (using entities to predict words and other entities) to non-textual entities. Kim et al. (2014) and Kulkarni et al. (2015)  neural language models for each year of a time ordered corpora to detect changes in words. Instead of learning models for distinct times, we learn a global model with embeddings for time-dependent entities that can be included in queries to analyze change. Kiros et al. (2014) learn embeddings for text attributes by treating them as gating units to a word embedding tensor. Their process is more computationally intensive than ours.

Conclusions and Future Work
We learned vector representations of text meta-data on a novel data set of legal texts that includes case, statutory, and administrative law. The representations effectively encoded important relationships between institutional actors that were not explicitly provided during training. Finally, we demonstrated fine-grained investigations of policy differences between actors based on vector arithmetic. More generally, the method can be applied to measuring similarity between any entities producing text, and used for recommendations, e.g. what's the closest thinktank vector to the non-profit vector representation of the Sierra Club? Methodologically, our next goal is to explore where training on non-textual relations, i.e. Structural Gov2Vec, is beneficial. It seems to help stabilize representations when exploiting temporal rela-tions, but political relations may prove to be even more useful. Substantively, our goal is to learn a large collection of vectors representing government actors at different resolutions and within different contexts 6 to address a range of targeted policy queries. Once we learn these representations, researchers could efficiently search for differences in law and policy across time, government branch, and political party.