Are"Undocumented Workers"the Same as"Illegal Aliens"? Disentangling Denotation and Connotation in Vector Spaces

In politics, neologisms are frequently invented for partisan objectives. For example,"undocumented workers"and"illegal aliens"refer to the same group of people (i.e., they have the same denotation), but they carry clearly different connotations. Examples like these have traditionally posed a challenge to reference-based semantic theories and led to increasing acceptance of alternative theories (e.g., Two-Factor Semantics) among philosophers and cognitive scientists. In NLP, however, popular pretrained models encode both denotation and connotation as one entangled representation. In this study, we propose an adversarial neural network that decomposes a pretrained representation as independent denotation and connotation representations. For intrinsic interpretability, we show that words with the same denotation but different connotations (e.g.,"immigrants"vs."aliens","estate tax"vs."death tax") move closer to each other in denotation space while moving further apart in connotation space. For extrinsic application, we train an information retrieval system with our disentangled representations and show that the denotation vectors improve the viewpoint diversity of document rankings.


Introduction
Language carries information through both denotation and connotation. For example, a reporter writing an article about the leftmost wing of the Democratic party can choose to refer to the group as "progressives" or as "radicals". The word choice does not change the individuals referred to, but it does communicate significantly different sentiments about the policy positions discussed. This type of linguistic nuance presents a significant challenge for natural language processing systems, most of which fundamentally assume words to have similar meanings if they are surrounded in similar  : Nearest neighbors of government-run healthcare (triangles) and economic stimulus (circles). Note that words cluster as strongly by policy denotation (shapes) as by partisan connotation (colors); namely, pretrained representations conflate denotation with connotation. Plotted by t-SNE with perplexity = 10. word contexts. Such assumption risks confusing differences in connotation for differences in denotation or vice versa. For example, using a common skip-gram model (Mikolov et al., 2013) trained on a news corpus (described in §3.2), Figure 1 shows nearest neighbors of "government-run healthcare" and "economic stimulus". The resulting t-SNE clusters are influenced as much by policy denotation (shapes) as they are by partisan connotation (colors 1 ). Using these entangled representations in applications such as information retrieval could have pernicious consequences such as reinforcing ideological echo chambers and political polarization. For example, a right-leaning query like "taxpayer-funded healthcare" could make one equally (if not more) likely to see articles about "totalitarian" and "horror stories" than about "affordable healthcare".
To address this, we propose classifier probes that measure denotation and connotation information in a given pretrained representation, and we arrange the probe losses in an adversarial setup in order to decompose the entangled pretrained meaning into distinct denotation and connotation representations ( §4). We evaluate our model intrinsically and show that the decomposed representations effectively disentangle these two dimensions of semantics ( §5). We then apply the decomposed vectors to an information retrieval task and demonstrate that our method improves the viewpoint diversity of the retrieved documents ( §6). All data, code, preprocessing procedures, and hyperparameters are included in the appendix and our GitHub repository. 2

Philosophical Motivation
Consider the following two sentences: "Undocumented workers are undocumented workers" vs. "Undocumented workers are illegal aliens". Frege (1892) famously used sentence pairs like these, which have the same truth conditions but clearly different meanings, in order to argue that meaning is composed of two components: "reference", which is some set of entities or state of affairs, and "sense", which accounts for how the reference is presented, encompassing a large range of aspects such as speaker belief and social convention. In contemporary philosophy of language, the sense and reference argument has evolved into debates of semantic externalism vs. internalism and referential vs. conceptual role semantics. Externalists and referentialists 3 continue the truthconditional tradition and emphasize meaning as some entity to which one is causally linked, invariant of one's psychological encoding of the referent (Putnam, 1975;Kripke, 1972). On the other hand, conceptual role semanticists emphasize meaning as what inferences one can draw from a lexical concept, deemphasizing the exact entities which the concept includes (Greenberg and Harman, 2005). Naturally, a popular position takes the Cartesian product of both schools of meaning (Block, 1986;Carey, 2009). This view is known as Two-Factor Semantics, and it forms the inspiration for our work. To avoid confusion with definitions from existing literature, we use the terms "denotation" and "connotation" rather than "reference" and "concept" when discussing our models in this paper.

Data
We assume that it is possible to disentangle the two factors of semantics by grounding language to different components of the non-linguistic context. In particular, our approach assumes access to a set of training sentences, each of which grounds to a denotation d (which approximates reference) or a connotation c (which approximates conceptual inferences). We require at least one of d or c to be observed, but we do not require both (elaborated in §4.3). In this work, d and c are discrete symbols. However, our model could be extended to settings in which d and c are feature vectors.
While we are interested in learning lexical-level denotation and connotation, we train on sentenceand document-level speaker and reference labels. We argue that this emulates a more realistic form of supervision. For example, we often have metadata about a politician (e.g., party and home state) when reading or listening to what they say, and we are able to aggregate this to make lexical-level judgements about denotation and connotation.
We experiment on two corpora: the Congressional Record (CR) and the Partisan News Corpus (PN), which differ in linguistic style, partisanship distribution ( Figure 2), and the available labels for grounding denotation and connotation. Figure 2: Vector spaces that result from training vanilla word2vec on the Congressional Record (left) and Partisan News (right). We evaluate on both corpora, but note that the Partisan News corpus better exemplifies the problem we target where words cluster strongly according to ideological stance.

Congressional Record
The Congressional Record (CR) is the official transcript of the floor speeches and debates of nuanced overview as well as related theories in linguistics and cognitive science.  In order to assign labels that can be used as proxies of denotation, we weakly label each sentence with both its legislative topic and the specific bill being debated. 5 To do this, we collected a list of congressional bills from the U.S. Government Publishing Office. 6 For our purposes, this data provides the congressional session, policy topic, and an informal short title for each bill. We perform a regular expression search for each bill's short title among the speeches in its corresponding congressional session. For bills that are mentioned at least 3 times, we assume that the speech in which the bill was mentioned as well as 3 subsequent speeches are referring to that bill, and we label each speech with the title and the policy topic of that bill. Speeches that are not labeled by this process are discarded. Additional details and examples are given in Appendix D.

Partisan News Corpus
Hyperpartisan News is a set of web articles collected for a 2019 SemEval Task (Kiesel et al., 2019). It consists of articles scraped from the political sections of 383 news outlets in English. Each article is associated with a publisher which, in turn, has been manually labeled with a partisan leaning on a five-point scale: "left, center-left, center, center-right, right". Upon manual inspection, we 4 2011 is the latest session available for the Bound Edition of CR; 1981 is chosen because the Reagan Administration marks the last party realignment and thus we can expect connotation signals to remain reasonably consistent over this period. 5 We also experimented with collecting more precise reference labels using the entity linkers of both Google Cloud and Facebook Research on a variety of corpora. However, the results of entity linking were too poor to justify pursuing this direction further. We would love to see future works that devise creative ways to include better denotation grounding. 6 https://www.govinfo.gov/bulkdata/BILLSTATUS find that the distinctions between right vs. centerright and left vs. center-left are prone to annotation artifacts. Therefore, we collapse these labels into a three-point scale, and we refer to this 3-class corpus as the Partisan News (PN) corpus throughout. No denotation label is available for this corpus.

Model
Section 4.1 describes our model architecture. Sections 4.2 and 4.3 then describe specific instantiations that we use in our experiments. These variants are summarized in Table 1.

Overall Architecture
Let V deno , V conno , V pretrained be the vector spaces of denotation, connotation, and pretrained spaces respectively. Our model consists of two adversarial decomposers: The goal is to train D to preserve as much denotation information as possible while removing as much connotation information as possible from the pretrained representation. Symmetrically, C will preserve as much connotation as possible while removing as much denotation as possible from the pretrained representation. For clarity, let us focus on D for now. To measure how much denotation or connotation structure is encoded in V deno , we use two classifiers probes trained to predict the denotation label d or connotation label c, which yield two cross-entropy losses deno. probe and cono. probe respectively. In order to encourage the decomposer D to preserve denotation and remove connotation, we define its loss function as where σ is the sigmoid function and conno. adversary = KL Div (conno. probe predicted dist., uniform dist.) The adversarial loss conno. adversary rewards D to remove connotation structure such that the probe prediction is random. Meanwhile, the probes themselves are still only gradient updated with the usual cross-entropy losses-extracting and measuring as much denotation or connotation information as possible-independent of the decomposer D. As shown in Figure 3, C is set up symmetrically, so it is trained with the usual classification loss from its connotation probe and a KL divergence adversarial loss from its denotation probe.
Finally, we impose a reconstruction probe R with the loss function: which enforces that the combination of denotation and connotation subspaces preserves all the semantic meaning of the original pretrained space, as opposed to merely encoding predictive features that maximize probe accuracies. (We verified in ablation experiments that this is in fact what happens without R.) Assembling everything together, the decomposers D and C are jointly trained with L Joint = L D + L C + recon. .
In principle, D and C can be a variety of sentence encoders. In this work, we implement them as simple mean bags of static embedding for two reasons: First, it is difficult to interpret contextualized embedding for an individual word (especially for the type of analysis we present in §5). Second, many of the interesting heavily connotative expressions consist of multiple words (e.g., "socialized medicine", "universal healthcare") and compositionality is still far from being solved. Therefore, we conjoin multiword expressions with underscores so that we can model them in the same way as atomic words. 7

Connotation Probes
We exploit the fact that much of the debate in American politics today is (sadly) reducible to partisan division (Lee, 2009;Klein, 2020), thus it is safe to define the connotation label of every document to be simply the partisanship of the speaker. Of course, connotation in the general domain can encompass much more than liberals vs. conservatives, and in future work, we hope to extend this to multifaceted connotations that are more true to the semantic theories as discussed in §2. For now, in CR, connotation is the speaker's party, and in PN, connotation is the partisan leaning of the publisher.
Again, in principle, the probes can be a variety of neural modules. In this work, we implement the connotation probes as 4-layer MLPs. We experimented with the more popular 1-layer MLP and 1-linear-layer probes. However, when the probes are shallow, the model converges before most of the information that should be removed is in fact removed. For example, when we use a 4-layer MLP probe on a decomposed representation trained with a 1-layer probe, the 4-layer probe accuracies are just as good as if the representation has not been decomposed at all. That is, our experiments suggest that the probes have to be sufficiently complex in order to truly measure what denotation/connotation structure is removed or preserved in a decomposed representation.

Denotation Probes
For the CR corpus, we experimented with two types of denotation labels: The specific piece of legislation under discussion and the general policy topic under discussion. In CR BILL, the label is one of the 1,029 short titles of bills. In CR TOPIC, the label is one of 41 policy topics. Both types of labels are annotated as described in §3.1. For the same reason as discussed in the previous paragraph, we implement the denotation probes as 4-layer MLPs.
Additionally, as mentioned in Footnote 5, precise denotation labels are difficult to collect, so we also experimented with more realistic settings (CR PROXY and PN PROXY) which do not use any denotation labels. In this case, we return to the theories discussed in §2 and note that, because semantic meaning can be partitioned into two components, we may assume pretrained representations encode the overall meaning and any aspects of meaning that are not explained by our connotation labels must belong to denotation. 8 Thus, we may continue to use the pretraining objective (in this implementation, skip-gram-style context word prediction) as a proxy probe for denotation information and rely on the adversarial connotation probe to remove connotation structure in the denotation space.

Intrinsic Evaluation
We confirm that our decomposed denotation and connotation spaces reflect their intended purposes by measuring their structures with homogeneity metrics ( §5.1) on three sets of evaluation words ( §5.2) as well as inspecting their t-SNE clusters.

Homogeneity Metrics
To quantify how much denotation or connotation structure is encoded in a vector space, we define 8 We acknowledge that this feels a bit backward: Ideally, in a Fregean sense, everything not explained by reference is left over to sense, rather than the converse. However, we are constrained by the available grounding. In a different setting, if we had explicit referential labels but no speaker information, we could use skip-gram as the proxy for connotation instead. the homogeneity (h deno , h conno ) of a given space to be the average proportion of a query word's top-k nearest neighbors 9 which share the same denotation/connotation label as the query's own denotation/connotation label. 10 In particular, we are interested in comparing the delta of V deno and V conno against V pretrained . For V deno , we hope to see h deno increase relative to the pretrained space and see h conno decrease relative to the pretrained space. For V conno , we hope to observe movement in the opposite direction.
As motivated in §3, our model is trained with labels at the sentence-level, while homogeneities are evaluated at the word-level. We assign a word's connotation label to simply be the party that uses the word most often. For CR BILL and CR TOPIC, we assign the word-level denotation label as either the bill or the topic that uses the word most often. For the PN corpus, no ground truth denotation label is available, so we cannot directly measure h deno , but we show alternative evaluation in §5.3. Table 3 shows the baseline h deno and h conno scores for embeddings pretrained on each corpus and evaluating over two test sets of words (described in the next section).

Test Sets
We evaluate on words sampled in three different ways: Random is a random sample of 500 words drawn from each corpus' vocabulary that occur at least 100 times in order to filter out web scraping artifacts, e.g., URLs and author bylines. High Partisan is a sample of around 300 words from each corpus's vocabulary that occur at least 100 times and have high partisan skew; namely, words that are uttered by a single party more than 70% of the time. This threshold is chosen based on manual inspection, but we have evaluated on other thresholds as well with no significant difference in results. This High Partisan set is then bisected into two disjoint sets as dev and test data for model selection. All word sets sampled at different ratios are included in our released data. Finally, Luntzesque is a small set of manually-vetted pairs of words that are known to have the same denotation but different connotations.  Figure 4: Neighborhood of "deficit" in V pretained , V deno , and V conno of PN PROXY. Arrows point to the top-10 nearest neighbors. Colors reflect partisan leaning, where more opaque dots are more heavily partisan words. Note that in V pretained and in V conno , the nearest neighbors are all Republican-leaning words, whereas they are balanced in V deno .
Vdeno (and ∆ with Vpre) Vconno (and ∆ with Vpre)   from The New American Lexicon (Luntz 2006 11 ), a famous report from focus group research which explicitly prescribes word choices that are empirically favorable to the Republican party line.

Results
Overall, we see that our V deno and V conno spaces demonstrate the desired shift in homogeneities and structures, which is intuitively illustrated by Figure 4. Quantitatively, Table 2 enumerates the homogeneity scores of both decomposed spaces as well as their directions of change relative to the pretrained space. For V deno , we see that denotation homogeneity h deno consistently increases and con-11 This is a leaked report circulated via a Google Drive link which has been taken offline since. A copy is included in our released data. notation homogeneity h conno consistently decreases as desired. Conversely, for V conno , we see h conno increases and h deno decreases as desired. Further, we see that the magnitude of change is greater across the board for the highly partisan words than for random words, which is expected as the highly partisan words are usually loaded with more denotation or connotation information that can be manipulated. The only exception is CR PROXY's V deno , which sees no significant movement in either direction. This is understandable because CR PROXY is not trained with ground truth denotation labels. (We evaluate it with the labels from CR BILL).
As means of closer inspection, we compute the cosine similarities of words in our Luntz-esque analysis set. Because these pairs of words are known to be political euphemisms (e.g. "estate tax" and "death tax", which refer to the same tax policy but imply opposite partisanship), we expect these pairs to become more cosine similar in V deno and less cosine similar in V conno . As shown in Table 4, even without ground truth denotation labels, the V deno of CR PROXY and PN PROXY still preserve the pretrained denotation structure reasonably well. For pairs that do see decrease in V deno similarity, the errors are far smaller relative to their correct  reduction in V conno similarity. For example, "political speech" and "campaign spending" experience a small (−0.02) decrease in denotation similarity; in exchange, the model correctly recognizes that the two words have opposite ideologies (−0.81 in connotation similarity) on the issue of whether unlimited campaign donation is shielded by the First Amendment as "political speech".

Extrinsic Evaluation
Ultimately, our work aims to be more than just a theoretical exercise, but also to enable greater control over how sensitive NLP systems are to denotation vs. connotation in downstream tasks. To this end, we construct an ad hoc information retrieval task. We compare a system built on top of V pretrained to systems built on top of V deno and V conno in terms of both the quality of the ranking and the ideological diversity represented among the top results.

Setup
We focus only on PN PROXY for this evaluation since it best matches the setting where we would expect to apply these techniques in practice: (1) We cannot always assume access to discrete denotation labels.
(2) Language in the PN corpus is strongly influenced by ideology (as shown in Figure 2). To generate a realistic set of queries, we start with 12 seed words from our vocabulary, chosen based on a list of the most important election issues for Democrat and Republican voters according to a recent Gallup Poll 12 . This results in the following list: "economy, healthcare, immigration, women's rights, taxes, wealth, guns, climate change, foreign policy, supreme court, tariffs, special counsel". Then, for each seed word, we take 5 left-leaning seeds to be the 5 nearest neighbors according to V pretrained , filtered to words which occur at least 100 times and for which at least 70% of occurrences appeared in left-leaning articles. We similarly chose 5 right-leaning seeds. We then submit each partisan seed to the Bing Autosuggest API and retrieve 10 suggestions each. We manually filter the list of queries to remove those that do not reflect the intended word sense (e.g., "VA" leading to queries about Virginia rather than the Veterans Administration) and those which are not well matched to our document collection (e.g., queries seeking dictionary definitions, job openings, or specific websites such as Facebook). Our final list contains 410 queries, 216 left-leaning and 194 right-leaning. Table 5 shows several examples, the full list is included in the supplementary material.
Wealth: globalist agenda • globalist leaders • extreme poverty rates • romneys ties to burisma Women's Rights: title ix impact • safe spaces and snowflakes • anti-choice zealots • marriage equality court case Immigration: illegal immigrants at southern border • illegals caught voting 2016 • drug policy fbi • opioid crisis afghanistan Table 5: Example right-and left-leaning queries generated using the procedure described.

Models
We generate a ranked list of documents for each query in a two-step manner: (1) We pre-select the 5,000 most relevant documents according to a tra-ditional BM25 model (Robertson et al., 1995) with default parameters. (2) This initial set of documents is then ranked using DRMM (Guo et al., 2016), a neural relevance matching model for adhoc retrieval. We train our retrieval model on the MS MARCO collection (Bajaj et al., 2016) of 550,000 queries and 8.8 million documents from Bing. To highlight the effect of pretrained vs. decomposed word embeddings, we freeze our word embeddings during retrieval model training. While (1) is purely based on TF-IDF style statistics and remains static for all compared conditions, (2) is repeated for every proposed word embedding. This results in a ranked list of the top 100 most relevant documents for each query and word embedding.

Results
We compare the results of the DRMM retrieval model using different word embeddings in terms of quality and diversity of viewpoints reflected in the ranked results. To measure diversity, we report the overall distribution of political leanings among the top 100 documents and the rank-weighted α-nDCG (Clarke et al., 2008) diversity score. For α-nDCG, higher values indicate a more diverse list of results whose political leanings are evenly distributed across result list ranks. To measure ranking quality, we take a sample of 10 queries and collect top 10 results returned by each model variant, for a total of 300 query/document pairs. We shuffle the list of pairs to avoid biasing ourselves, and manually label each pair for whether or not the document is relevant to the query. We report Precision@10 estimated based on these 10 queries. Figure 5 shows the overall party distributions. Table 6 reports the α-nDCG and P@10 metrics. We can see that models which use V deno produce more diverse rankings than do models that use V pertained , with V deno producing an α-nDCG@100 of 0.94 vs. 0.92 for pretrained. This trend is especially apparent in the rankings returned for right-leaning queries: Under the pretrained model, 57% of the documents returned came from right-leaning news sources, whereas under the V deno -based model, the results are nearly perfectly balanced between news sources. However, we do see a drop in precision when using V deno . This is not surprising given the limitations observed in §5. If we had access to ground-truth denotation labels when training V deno , we might expect to see these numbers improve. This is a promising direction for future work.

Related Work
Embedding Augmentation. At the lexical level, there is substantial literature that supplements pretrained representations with desired information (Faruqui et al., 2015;Bamman et al., 2014) or improves their interpretability (Murphy et al., 2012;Arora et al., 2018;Lauretig, 2019). However, existing works tend to focus on evaluating the dictionary definitions of words, less so on grounding words to specific real world referents and, to our knowledge, no major attempt yet in interpreting and manipulating the denotation and connotation dimensions of meaning as suggested by the semantic theories discussed in §2. While we do not claim to do full justice to conceptual role semantics either, this paper furnishes a first attempt at implementing a school of semantics introduced by philosophers of language and increasingly popular among cognitive scientists.
Style Transfer. At the sentence level, adversarial setups similar to ours have been previously ex-  (2019) converted informal English to formal English and Yelp reviews from positive to negative sentiment. The motivation for such models is primarily natural language generation and the personalization thereof . Additionally, our framing in terms of Frege's sense and reference adds clarity to the sometimes illdefined problems explored in style transfer (e.g., treating sentiment as "style"). For example, "she is an undocumented immigrant" and "she is an illegal alien" have the same truth conditions but different connotations, whereas "the cafe is great" and "the cafe is terrible" have different truth conditions.

Modeling Political Language.
There is a wealth of work on computational approaches for modeling political language (Glavaš et al., 2019). Within NLP, such efforts tend to focus more on describing how language differs between political subgroups, rather than recognizing similarities in denotation across ideological stances, which is the primary goal of our work. Also highly related is work analyzing linguistic framing in news (Greene and Resnik, 2009;Choi et al., 2012;Baumer et al., 2015).
Echo Chambers and Search. The dangers of ideological "echo chambers" have received significant attention across NLP, information retrieval, and social science research communities. Dori-Hacohen et al. (2015) discuss the challenges of deploying information retrieval systems in controversial domains, and Puschmann (2019) looks specifically at the effects of search personalization on election-related information. Many approaches have been proposed to improve the diversity of search results, typically by identifying search facets a priori and then training a model to optimize for diversity (Tintarev et al., 2018;Tabrizi and Shakery, 2019;Lunardi, 2019). In terms of linguistic analyses, Rashkin et al. (2017) and Potthast et al. (2018) analyze stylistic patterns that distinguish fake news from real news. Duseja and Jhamtani (2019) study linguistic patterns that distinguish whether individuals are within social media echo chambers.

Summary
In this paper, we describe the problem of pretrained word embeddings conflating denotation and connotation. We address this issue by introducing an adversarial network that explicitly represents the two properties as two different vector spaces. We confirm that our decomposed spaces encode the desired structure of denotation or connotation by both quantitatively measuring their homogeneity and qualitatively evaluating their clusters and their representation of well-known political euphemisms. Lastly, we show that our decomposed spaces are capable of improving the diversity of document rankings in an information retrieval task.