Conditional Word Embedding and Hypothesis Testing via Bayes-by-Backprop

Conventional word embedding models do not leverage information from document meta-data, and they do not model uncertainty. We address these concerns with a model that incorporates document covariates to estimate conditional word embedding distributions. Our model allows for (a) hypothesis tests about the meanings of terms, (b) assessments as to whether a word is near or far from another conditioned on different covariate values, and (c) assessments as to whether estimated differences are statistically significant.


Introduction
Whether a word's meaning varies across contexts has become a major focus of NLP, linguistics, and social science research in recent years. For example, since the early 20th century, the word "gay" has evolved from describing an emotion to being more aligned with sexual orientation (Hamilton et al., 2016b). Popular word embedding techniques (e.g., Mikolov et al., 2013a;Pennington et al., 2014) have proven useful for analyzing language evolution. But to use these models for such research, scholars often divide a corpus into distinct training sets (e.g., train independent language models on different decades of text) and compare model output across specifications in an ad hoc way (Garg et al., 2018). Such splitting inhibits many within-and across-word comparisons, since embeddings are only comparable within a given model. Additionally, most methods ignore the variance of words, mechanically treating words equally regardless of the volatility, or uncertainty, in their meanings. If one inspects semantics with only point estimates of embeddings, it is hard to tell whether embeddings represent meaningful traits or are simply noise in the data.
We address these concerns in three ways. First, we estimate a vector for each distinct value of the document covariates, using a multilayer perceptron (MLP) with a non-linear activation function. Second, we parametrize the covariance matrix of each embedding vector explicitly in the model, adopting the Bayes-by-Backprop algorithm (Blundell et al., 2015). Third, we utilize Hotelling T 2 statistics (Hotelling, 1931) to assess whether estimated differences in word vectors are statistically differentiable under a null χ 2 distribution (Ito, 1956). To our knowledge, no prior work evaluates word embeddings with this statistical framework.

Related Work
Drift Analysis using Word Embeddings There are several ways to measure drifts in word meanings. Hamilton et al. (2016c) propose the use of cosine similarities of words in different contexts to detect changes. Hamilton et al. (2016b) provide an alternative measure based on the distance of words from their nearest neighbors. Rudolph and Blei (2018) analyze absolute drift of words using Euclidean distance in (two discrete) slices of data. All of these methods compute the word distance based only on the point (i.e., mean) estimates of the word embeddings. Rudolph and Blei (2018) estimate dynamic Bernoulli embeddings (DBE), extending the exponential family embedding (Rudolph et al., 2016) generalization of Mikolov et al. (2013a), to learn conditional word embeddings over time. Their amortized approach builds a separate neural network that transforms a global word vector into a covariatespecific vector, and is closely related to our approach in this paper. However, a noticeable omission in their model is that they do not explicitly model parameter covariance or uncertainty.

Conditional Word Embedding
Word Embedding with Uncertainty Vilnis and McCallum (2017) earlier proposed an energy-based learning framework in which each word is represented as a multivariate Gaussian distribution with a diagonal covariance. The energy function is defined by the divergence (e.g., KL) between two Gaussian embeddings, and the margin ranking loss (Weston et al., 2011) is minimized. A related model is the Bayesian skip-gram in Brazinskas et al. (2017), which posits a generative model where words are associated with multivariate Gaussian latent variables that generate context words. The parameters of those prior distributions over the multivariate Gaussian latent variables are estimated by maximizing the variational lowerbound, and act as word embeddings.
These works replace mean estimates of embeddings with Gaussian distributions, similar to our proposal here. However, they arrive at this differently; Vilnis and McCallum (2017) from the energy-based learning (LeCun et al., 2006), and Brazinskas et al. (2017) from generative modeling.
We provide yet another angle: via (approximate) Bayesian neural networks.

Conditional Word Embedding
Adopting Bayes-by-Backprop for Estimation Given a tuple of a word v, a covariate x and a context word v c , we define the conditional logprobability as where θ v|x and θ c vc are the conditional word embedding of v given x and the context embedding of v c , respectively. V is the vocabulary of all unique words. To avoid the expensive computation of the partition function, we use negative sampling (Mikolov et al., 2013b), which stochastically approximates the log-probability above by: where v m c ∈ V is the m-th negative sample drawn from a unigram distribution estimated from D.
We define a prior distribution over each parameter θ to be a scaled mixture of two Gaussians, as suggested by Blundell et al. (2015): where σ 1 , σ 2 and u are the hyperparameters. As exactly marginalizing out the parameters θ · and θ c · is not scalable, we maximize the variational lowerbound of the marginal probability. To do so, we introduce a variational posterior q(θ|φ) parametrized by its own parameter set φ. Then, the variational lowerbound is defined as where θ (m) is the m-th sample from the variational posterior q (Blundell et al., 2015) via the Gaussian reparametrization in Kingma and Welling (2013).
We formulate the variational posterior as a multivariate Gaussian with diagonal covariance. We use stochastic gradient descent (SGD) to minimize F with respect to the variational parameters φ. At each SGD step, we compute the gradient of the following per-example cost given an whereθ is a single sample from the approximate posterior, and log p(v c |v, x) and log p(θ) are from Eqs.
(1)-(2). We then estimate the (approximate) posterior distribution of each conditional word embedding θ v|x rather than its point estimate, by minimizing F. See Sec. A of the supplementary material for the detailed steps for computing the per-example cost.

Parametrized Conditional Word Embedding
An issue with the approach described so far is the number of parameters grows linearly in the size of the vocabulary and in the number of covariate partitions, i.e., O(|V | × |C|), where C is the set of all partitions. This effectively excludes any potential sharing of structures underlying words across different covariate values and decreases the number of examples per parameter. To avoid this issue, we use a single parametrized function to compute the variational parameters φ of each conditional word embedding θ v|x . For each covariate-word v|x, there are two variational parameters µ v|x and σ v|x . We use an MLP without any hidden layer and tanh output layer, i.e., the affine transformation followed by pointwise tanh, that takes as input both a global word vector µ (v) v and a covariate vector µ (x) x and outputs µ v|x , i.e., , where ψ is the parameters of this mean-transformation network. The diagonal covariance σ v|x is parametrized as σ v|x = log(1 + exp(ρ v )), where ρ v is a parameter shared across all covariate configurations. We then minimize F w.r.t. these parameters ψ, µ This approach of parametrized conditional word embeddings significantly reduces the number of parameters from O(|V | × |C|) to O(|V | + |C|), while maintaining posterior uncertainty of the estimated conditional word embedding θ v|x .

Divergences for Word Embeddings
As we estimate the approximate posterior uncertainty of conditional word vectors, we can estimate richer relations between vectors (e.g., KL) in addition to more common comparisons (e.g., cosine or Euclidean distance). Moreover, we can explicitly test for whether two vectors are (un)likely to have the same mean in the population. Below, we introduce how Hotelling's T 2 may be used for worddrift or across-word hypothesis testing.
Hotelling's T 2 Statistic We use the estimated posterior mean vector µ v|x and the diagonal covariance vector σ v|x of two word-covariate pairs v|x i and v|x j to compute the T 2 statistic, as if they were estimates from two sets of samples: The pooled (diagonal) covariance s of word pairs is computed by s = (n i −1)·σ 2 i +(n j −1)·σ 2 j n i +n j −2 , where n i and n j are the numbers of occurrences of v|x i and v|x j in D, respectively. 1 Unlike other divergence measures, this T 2 statistic explicitly takes into account the frequencies of the word-covariate pairs. Under general conditions, e.g., D is large, the sampling distribution of T 2 converges to a χ 2 d distribution (Ito, 1956) with d equal to the embedding dimensionality. This allows us to statistically test such a null hypothesis as Diff(v i |x, v j |x) = 0 and Diff(v|x i , v|x j ) = 0.  Model and Learning For each word in the corpus, we consider six surrounding words as its context. The size of embedding is set to 100. We use six negative samples to compute Eq. (1). We use Adagrad (Duchi et al., 2011) with the initial learning rate 0.05 for learning. 2 For other hyperparameters, see the supplementary material. We refer to our approach by BBP. For comparison, we also train analogous DBE embeddings using code from the authors.

Result and Analysis
Impact of Covariates To demonstrate how document covariates influence conditional word embeddings, we compare the vector for "currency" against "sterling" and "pound" according to the KL divergence in each decade, which is shown in Fig. 1. In each time period we report the ranking of each w.r.t "currency". Here, we observe that pivotal points for both "sterling" and "pound" occur in the 1970s, which coincides with the moment the UK began to abandon the 'sterling area' (Part III in Schenk, 2010). As such, this financial policy appears to have encouraged semantic drift of the word "pound" towards "currency". See Sec. D in the supplementary material for more details.
We also show a few more examples in Figure 2 and Figure 3 from the Dictionary Induction section below.
Dictionary Induction As a quantitative comparison between the proposed approach and the DBE, we take a dictionary of (British) political terms by Laver and Garry (2000) and look at the The ranks between "market" and "money" across the decades according to KL divergence. Figure 3: The ranks between "benefit" and "children" across the decades according to KL divergence.
average pair-wise, directional rank in each category ("pro-state", "con-state" and "neutral-state"). We only consider the 2,000 most frequent words in the vocabulary and embeddings with the covariate (decade) set to 2000s. We observed that the proposed model using KL divergence has significantly smaller average pair-wise ranks in "prostate" (4052 vs. 5047) and "con-state" (2578 vs. 3758) while performs slightly worse than DBE in "neutral-state" category (5414 vs. 5031) suggesting that the proposal approach can cluster words from similar semantic group into closer neighbors than DBE. Furthermore, we pick 5 most frequent words from "pro-state" and "con-state" and show their average pair-wise rankings and percentile in Table  1. Out of 25K words, our proposed model is able to rank most chosen words within top 10% percentile.
Statistical Word Drift Analysis Our BBP approach permits meaningful downstream hypothesis tests of word drift, i.e, Diff(v|x i , v|x j ) = 0, and across-word similarity, i.e., Diff(v i , v j ) = 0. Among the 2,000 most frequent words in our sam-   With the p-value threshold of α = 0.1, only eleven words were deemed to have had significant drift, including "council", "labour", "european" and "defence". Sec. E of the supplement includes entire lists of this drift analysis.
In Table 1, we show results from five illustrative tests, drawn from the top-100 word drifts estimated by the DBE model. We report words' drift ranks in DBE against their corresponding L2 distance, cosine similarity, KL divergence and Hotelling T 2 using the embeddings estimated in our BBP model. Based on the distance metrics that ignore the covariance matrix, these words do not appear to change much over time as their cosine similarities are fairly large and their L2 distances are relatively small with little variation across the five words. This suggests their mean vectors are projected into close space between 1940s and 2000s. However, by taking into account their uncertainty, we observe greater variation in both KL divergence and T 2 statistic. For example, "council" has the eighth largest drift in DBE by L2, but shows the largest T 2 statistic among the five words and is statistically significant at α = 0.01. So too, the largest DBE drift ("uk") is insignificant once you take into account the covariance structure. Cosine Similarity vs. KL Divergence In contrast to cosine distance, our proposed method allows computation of the KLD between two vectors that takes into account their covariance. Figure 2 presents semantic graphs estimated in the spirit of Hamilton et al. (2016a). The set of words is given by the union set of the 10 nearest neighbors, measured by cosine similarity and KLD, for the five seed words: "currency", "british", "health", "trade" and "labour". This results in 130 unique words including the seed words and we compute their pair-wise KLD matrix, W KL and pairwise cosine similarity matrix, W cos . We convert W KL to a symmetric matrix as W KL = (W KL + W T KL )/2. Both W KL and W cos have dimensions of 130 × 130. Figures 2.A and 2.B are computed by taking a sigmoid transformation of normalized entries in W KL , i.e., σ(normalize(w KL i,j )). Edge weights in 2.C and 2.D are computed by arccos(w cos i,j ), following Hamilton et al. (2016a). Edges with weights below 90th percentiles are dropped for visual clarity. Note that with the same number of edges being eliminated, the KLD charts appear more clustered around seed words, implying that incorporating covariance matrix creates useful segregation of words within local contexts; graphs constructed via cosine similarity seem to disperse edge weights in a more diffuse manner.

Edge weights in
T 2 -based Significance In the context of uncertainty-aware word embeddings, we can use the T 2 statistic to filter out additional words from a nearest neighbor set. For instance, in Figure  2.B and 2.D, we drop edges for word pairs that fall below the 90th percentile of computed T 2 statistics. Filtering with Hotelling T 2 results in more sparse semantic graphs.

Conclusion
We proposed an uncertainty-aware conditional word embedding model that combines two ideas; (1) variational Bayesian learning for estimating parameter uncertainty, and (2) structured embeddings conditioned on covariates. This provides a principled direction to investigate hypothesis tests of word vectors in various forms. We evaluated various aspects of the proposed approach on U.K. Parliament speech records from 1935-2012. We believe the proposed approach will serve as a more rigorous tool in social science and other domains.

Acknowledgments
KC thanks the support by eBay, TenCent, NVIDIA and CIFAR. RH thanks the support by MINDS research group at Information Sciences Institute of University of California.