Don’t Blame Distributional Semantics if it can’t do Entailment

Distributional semantics has had enormous empirical success in Computational Linguistics and Cognitive Science in modeling various semantic phenomena, such as semantic similarity, and distributional models are widely used in state-of-the-art Natural Language Processing systems. However, the theoretical status of distributional semantics within a broader theory of language and cognition is still unclear: What does distributional semantics model? Can it be, on its own, a fully adequate model of the meanings of linguistic expressions? The standard answer is that distributional semantics is not fully adequate in this regard, because it falls short on some of the central aspects of formal semantic approaches: truth conditions, entailment, reference, and certain aspects of compositionality. We argue that this standard answer rests on a misconception: These aspects do not belong in a theory of expression meaning, they are instead aspects of speaker meaning, i.e., communicative intentions in a particular context. In a slogan: words do not refer, speakers do. Clearing this up enables us to argue that distributional semantics on its own is an adequate model of expression meaning. Our proposal sheds light on the role of distributional semantics in a broader theory of language and cognition, its relationship to formal semantics, and its place in computational models.


Introduction
Distributional semantics has emerged as a promising model of certain 'conceptual' aspects of linguistic meaning (e.g., Landauer and Dumais 1997;Turney and Pantel 2010;Baroni and Lenci 2010;Lenci 2018) and as an indispensable component of applications in Natural Language Processing (e.g., reference resolution, machine translation, image captioning; especially since ). Yet its theoretical status within a general theory of meaning and of language and cognition more generally is not clear (e.g., Lenci 2008;Erk 2010;Boleda and Herbelot 2016;Lenci 2018). In particular, it is not clear whether distributional semantics can be understood as an actual model of expression meaning -what Lenci (2008) calls the 'strong' view of distributional semantics -or merely as a model of something that correlates with expression meaning in certain partial ways -the 'weak' view. In this paper we aim to resolve, in favor of the 'strong' view, the question of what exactly distributional semantics models, what its role should be in an overall theory of language and cognition, and how its contribution to state of the art applications can be understood. We do so in part by clarifying its frequently discussed but still obscure relation to formal semantics.
Our proposal relies crucially on the distinction between what linguistic expressions mean outside of any particular context, and what speakers mean by them in a particular context of utterance. Here, we term the former expression meaning and the latter speaker meaning. 1 At least since Grice 1968 this distinction is generally acknowledged to be crucial to account for how humans communicate via language. Nevertheless, the two notions are sometimes confused, and we will point out a particularly widespread confusion in this paper. Consider an example, one which will recur throughout this paper: (1) The red cat is chasing a mouse.
The expression "the red cat" in this sentence can be used to refer to a cat with red hair (which is actually orangish in color) or to a cat painted red; "a mouse" to the animal or to the computer device; and in the right sort of context the whole sentence can be used to describe, for instance, a red car driving behind a motorbike. It is uncontroversial that the same expression can be used to communicate very different speaker meanings in different contexts. At the same time, it is likewise uncontroversial that not anything goes: what a speaker can reasonably mean by an expression in a given context -with the aim of being understood by an addressee -is constrained by its (relatively) context-invariant expression meaning. An important, long-standing question in linguistics and philosophy is what type of object could play the role of expression meaning, i.e., as a context-invariant common denominator of widely varying usages. There exist two predominant candidates for a model of expression meaning: distributional semantics and formal semantics. Distributional semantics assigns to each expression, or at least each word, a highdimensional, numerical vector, one which represents an abstraction over occurrences of the expression in some suitable dataset, i.e., its distribution in the dataset. Formal semantics assigns to each expression, typically via an intermediate, logical language, an interpretation in terms of reference to entities in the world, their properties and relations, and ultimately truth values of whole sentences. 2 To illustrate the two approaches, simplistically (and without intending to commit to any particular formal semantic analysis or (compositional) distributional semantics -see Section 5): (2) The red cat is chasing a mouse.
Distributional semantics: → ↓ ← (i.e., a vector for each word) Distributional and formal semantics are often regarded as two models of expression meaning that have complementary strengths and weaknesses and that, accordingly, must somehow be combined for a more complete model of expression meaning (e.g., Beltagy et al. 2013;Erk 2013;Asher et al. 2016;Boleda and Herbelot 2016). For instance, in these works the vectors of distributional semantics are regarded as capturing lexical or conceptual aspects of meaning but not, or insufficiently so, truth conditions, reference, entailment and compositionality -and vice versa for formal semantics. 3 Contrary to this common perspective, we argue that distributional semantics on its own can in fact be a fully satisfactory model of expression meaning, i.e., the 'strong' view of distributional semantics in Lenci 2008. Crucially, we will do so not by trying to show that distributional semantics can do all the things formal semantics does -we think it clearly cannot, at least not on its own -but by explaining that a semantics should not do all those things. In fact, formal semantics is mistaken about its job description, a mistake that we trace back, following a long strand in both philosophical and psycholinguistic literature, to a failure to properly distinguish speaker meaning and expression meaning. By clearing this up we aim to contribute to a firmer theoretical understanding of distributional semantics, of its role in an overall theory of communication, and of its employment in current models in NLP.

What we mean by distributional semantics
By distributional semantics we mean, in this paper, a broad family of models that assign (contextinvariant) numerical vector representations to words, which are computed as abstractions over occur-rences of words in contexts. Implementations of distributional semantics vary, primarily, in the notion of context and in the abstraction mechanism used. A context for a word is typically a text in which it occurs, such as a document, sentence or a set of neighboring words, but it can also contain images (e.g., Feng and Lapata 2010;Silberer et al. 2017) or audio (e.g., Lopopolo and Miltenburg 2015) -in principle any place where one may encounter a word could be used. Because of how distributional models work, words that appear in similar contexts end up being assigned similar representations. At present, all models need large amounts of data to compute high-quality representations. The closer these data resemble our experience as language learners, the more distributional semantics is expected to be able in principle to generate accurate representations of -as we will argue -expression meaning.
As for the abstraction mechanism used,  distinguish between classic "countbased" methods, which work with co-occurrence statistics between words and contexts, and "predictionbased" methods, which instead apply machine learning techniques (artificial neural networks) to induce representations based on a prediction task, typically predicting the context given a word. For instance, the Skip-Gram model of  would, applied to example (1), try to predict the words "the", "red", "is", "chasing", etc. from the presence of the word "cat" (more precisely, it would try to make these context words more likely than randomly sampled words, like "democracy" or "smear"). By training a neural network on such a task, over a large number of words in context, the first layer of the network comes to represent words as vectors, usually called word embeddings in the neural network literature. These word embeddings contain information about the words that the network has found useful for the prediction task.
In both count-based and prediction-based methods, the resulting vector representations encode abstractions over the distributions of words in the dataset, with the crucial property that words that appear in similar contexts are assigned similar vector representations. 4 Our arguments in this paper apply to both kinds of methods for distributional semantics.
Word embeddings emerge not just from models that are expressly designed to yield word representations (such as . Rather, any neural network model that takes words as input, trained on whatever task, must 'embed' these words in order to process them -hence any such model will result in word embeddings (e.g., Collobert and Weston 2008). Neural network models for language are trained for instance on language modeling (e.g., word prediction; Mikolov et al. 2010;Peters et al. 2018) or Machine Translation (Bahdanau et al., 2015). As long as the data on which these models are trained consist of word-context pairs, the resulting word embeddings qualify, for present purposes, as implementations of distributional semantics, and our proposal in the current paper applies also to them. Of course some implementations within this broad family may be better than others, and the type of task used is one parameter to be explored: It is expected that the more the task requires a human-like understanding of language, the better the resulting word embeddings will represent -as we will argue -the meanings of words. But our arguments concern the theoretical underpinnings of the distributional semantics framework more broadly rather than specific instantiations of it.
Lastly, some implementations of distributional semantics impose biases, during training, for obtaining word vectors that are more useful for a given task. For instance, to obtain word vectors useful for predicting lexical entailment (e.g., that being a cat entails being an animal), Vulić and Mrkšić (2017) impose a bias for keeping the vectors of supposed hypernyms, like "cat" and "animal", close together (more precisely: in the same direction from the origin but with different magnitudes). This kind of approach presupposes, incorrectly as we will argue, that distributional semantics should account for entailment. It results in word vectors that are more useful for a particular task, but the model will be worse as a model of expression meaning. We will return to this type of approach in section 3.2.
We present two theoretical reasons why distributional semantics is attractive as a model of expression meaning, before arguing in section 4 that it can also be sufficient.

Reason 1: Meaning from use; abstraction and parsimony
We take it to be uncontroversial that what expressions mean is to be explained at least in part in terms of how they are used by speakers of the relevant linguistic community (e.g., Wittgenstein 1953;Grice 1968). 5 A similar view has motivated work on distributional semantics (e.g., Lenci 2008; also at its conception, e.g., Harris 1954). For instance, what the word "cat" means is to be explained at least in part in terms of the fact that speakers have used it to refer to cats, to describe things that resemble cats, to insult people in certain ways, and so on. Note that the usages of words generally resist systematic categorization into definable senses, and attempts to characterize word meaning by sense enumeration generally fail (e.g., Kilgarriff 1997;Hanks 2000;Erk 2010;cf. Pustejovsky 1995).
A minimal, parsimonious way of explaining the meaning of an expression in terms of its uses is to say simply that the meaning of an expression is an abstraction over its uses. Such abstractions are, of course, exactly what distributional semantics delivers, and the view that it corresponds to expression meaning is what Lenci (2008) calls the 'strong' view of distributional semantics. Distributional semantics is especially parsimonious because it relies on (mostly) domain-independent mechanisms for abstraction (e.g., principal components analysis; neural networks). Of course not all implementations are equally adequate, or equally parsimonious; there are considerable differences both in the abstraction mechanism relied upon and in the dataset used (see section 2). But the family as a whole, defined by the core tenet of associating with each word an abstraction over its use, is highly suitable in principle for modeling expression meaning. This makes the 'strong' view of distributional semantics attractive.
An alternative to the 'strong' view is what Lenci (2008) calls the 'weak' view: that an abstraction over use may be part of what determines expression meaning, but that more is needed. This view underlies for instance the common assumption that a more complete model of expression meaning would require integrating distributional and formal semantics (e.g., Beltagy et al. 2013;Erk 2013;Asher et al. 2016;Boleda and Herbelot 2016). But in section 4 we argue that the notions of formal semantic, like reference, truth conditions and entailment, do not belong at the level of expression meaning in the first place, and, accordingly, that distributional semantics can be sufficient as a model of expression meaning. Theoretical parsimony dictates that we opt for the least presumptive approach compatible with the empirical facts, i.e., with what a theory of expression meaning should account for.
Some authors equate the meaning of an expression not with an abstraction over all uses, but only stereotypical uses: what an expression means would be what a stereotypical speaker in a stereotypical context means by it (e.g., Schiffer 1972;Bennett 1976;Soames et al. 2002). This approach is appealing because it does justice to native speaker's intuitions about expression meaning, which are known to reflect stereotypical speaker meaning (see Section 4). However, several authors have pointed out that stereotypical speaker meaning is ultimately not an adequate notion of expression meaning (e.g., Bach 2002;Recanati 2004). To see just one reason why, consider the following arbitrary example: (3) Jack and Jill got married.
A stereotypical use of this expression would convey the speaker meaning that Jack and Jill got married to each other. But this cannot be the (context-invariant) meaning of the expression "Jack and Jill got married", or else the following additions would be redundant and contradictory, respectively: 6 (4) Jack and Jill got married to each other.
(5) Jack and Jill got married to their respective childhood friends.
Hence the stereotypical speaker meaning of (3) cannot be its expression meaning. For many more examples and discussion see Bach 2002. Another challenge for defining expression meaning as stereotypical speaker meaning is that of having to define "stereotypical". It cannot be defined simply as the most frequent type, because that presupposes that uses can be categorized into clearly delineated, countable types. Moreover, an 'empty' context is a context too, and not the most stereotypical one.
Summing up: what an expression means depends on how speakers use it, but the uses of an expression more generally resist systematic categorization into enumerable senses, and selecting a stereotypical use isn't adequate either. Equating expression meaning with an abstraction over all uses, as the 'strong' view of distributional semantics has it, is more adequate, and particularly attractive for reasons of parsimony.

Reason 2: Distributional semantics as a model of concepts
Another reason why distributional semantics is attractive as a model of expression meaning is the following. As mentioned in section 1, distributional semantics is often regarded as a model of 'conceptual' aspects of meaning (e.g., Landauer and Dumais 1997; Baroni and Lenci 2010;Boleda and Herbelot 2016). This view seems to be motivated in part empirically: distributional semantics is successful at what are intuitively conceptual tasks, like modeling word similarity, priming and analogy. Moreover, it aligns with the widespread view in philosophy and developmental psychology that abstraction over instances is a main mechanism of concept formation (e.g., the influential work of Jean Piaget). Let us explain why concepts, and in particular those modeled by distributional semantics (because there is some confusion about their nature), would be suitable representatives of expression meaning.
It is sometimes assumed that the word vector for "cat" should model the concept CAT (we discuss some work that makes this assumption below). This may be a 'true enough' approximation for practical applications, but theoretically it is, strictly speaking, on the wrong track. This is because the word vector for "cat" does not model the concept CAT -that would be an abstraction over occurrences of actual cats, after all. Instead, the word vector for "cat" is an abstraction over occurrences of the word, not the animal, hence it would model the concept of the word "cat", say, THEWORDCAT. The extralinguistic concept CAT and the linguistic concept THEWORDCAT are very different. The concept CAT encodes knowledge about cats having fur, four legs, the tendency to meow, etc.; the concept THEWORDCAT instead encodes knowledge that the word "cat" is a common noun, that it rhymes with "bat" and "hat", how speakers have used it or tend to use it, that the word doesn't belong to a particular register, and so on. 7 Our distinction between THEWORDCAT and CAT, or between linguistic and extralinguistic concepts, is not new, and word vectors are known to capture the more linguistic kind of information, and to be (at best) only a proxy for the extralinguistic concepts they are typically used to denote by a speaker (e.g., Miller and Charles 1991). But it appears to be sometimes overlooked. For instance, the assumption that the word vector for "cat" would (or should) model the extralinguistic concept CAT is made in work using distributional semantics to model entailment, e.g., that being a cat entails being an animal (e.g., Geffet and Dagan 2005;Roller et al. 2014;Vulić and Mrkšić 2017). But clearly the entailment relation holds between the extralinguistic concepts CAT and ANIMAL -being a cat entails being an animal -not between the linguistic concepts THEWORDCAT and THEWORDANIMAL actually modeled by distributional semantics: being the word "cat" does not entail (in fact, it excludes) being the word "animal". Hence these approaches are, strictly speaking, theoretically misguided -although their conflation of linguistic and extralinguistic concepts may be a defensible simplification for practical purposes.
There have been many proposals to integrate formal and distributional semantics (e.g., Beltagy et al. 2013;Erk 2013;Asher et al. 2016), and a similar confusion exists in at least some of them (Asher et al., 2016;McNally and Boleda, 2017). We are unable within the scope of the cur-rent paper to do justice to the technical sophistication of these approaches, but for present purposes, impressionistically, the type of integration they pursue can be pictured as follows: The red cat is chasing a mouse. Formal semantics: ιx(RED(x) ∧ CAT(x) ∧ ∃y(MOUSE(y) ∧ CHASE(x, y))) Distributional semantics: → ↓ ← (i.e., a vector for each word) Possible integration: ιx( (x) ∧ (x) ∧ ∃y( ←(y) ∧ ↓ (x, y))) (very simplistically) Again, this may be a 'true enough' approximation, but it is theoretically on the wrong track. The atomic constants in formal semantics are normally understood (e.g., Frege 1892 and basically anywhere since) to denote the extralinguistic kind of concept, i.e., CAT and not THEWORDCAT. Put differently, entity x in example (6) should be entailed to be a cat, not to be the word "cat". This means that the distributional semantic word vectors are, strictly speaking, out of place in a formal semantic skeleton like in (6). 8 In short, distributional semantics models linguistic concepts like THEWORDCAT, not extralinguistic concepts like CAT. But this is not a shortcoming; it makes distributional semantics more adequate, rather than less adequate, as a model of expression meaning, for the following reason. A prominent strand in the literature on concepts conceives of concepts as abilities (e.g., Dummett 1993;Bennett and Hacker 2008; for discussion see Margolis and Laurence 2014). For instance, possessing the concept CAT amounts to having the ability to recognize cats, discriminate them from non-cats, and draw certain inferences about cats. The concept CAT is, then, the starting point for interpreting an object as a cat and draw inferences from it. It follows that the concept THEWORDCAT is the starting point for interpreting a word as the word "cat" and drawing inferences from it, notably, inferences about what a speaker in a particular context may use it for: for instance, to refer to a particular cat. 9 Thus, the view of distributional semantics as a model of concepts, but crucially concepts of words, establishes word vectors as a necessary starting point for interpreting a word. This is exactly the explanatory job assigned to expression meaning: a context-invariant starting point for interpretation. Not coincidentally, for neural networks that take words as input, distributional semantics resides in the first layer of weights (see Section 2).
Summing up, this section presented two reasons why distributional semantics is attractive as a model of expression meaning. The next section considers whether it could also be sufficient.

Limits of distributional semantics: words don't refer, speakers do.
In many ways the standard for what a theory of expression meaning ought to do has been set by formal semantics. Consider again our simplistic comparison of distributional semantics and formal semantics: The red cat is chasing a mouse. Formal semantics: ιx(RED(x) ∧ CAT(x) ∧ ∃y(MOUSE(y) ∧ CHASE(x, y))) Distributional semantics: → ↓ ← (i.e., a vector for each word) The logical formulae into which formal semantics translates this example are assigned precise interpretations in (a model of) the outside world. For instance, RED would denote the set of all red things, CAT the set of all cat-like things, CHASE a set of pairs where one chases the other, the variable x would be bound to a particular entity in the world, etc., and the logical connectives can have their usual truthconditional interpretation. 10 In this way formal semantics accounts for reference to things in the world and it accounts for truth values (which is what sentences refer to; Frege 1892). Moreover, referents and truth values across possible worlds/situations in turn determine truth conditions, and thereby entailments -because one sentence entails another if whenever the former is true the latter is true as well. 11 By contrast, distributional semantics on its own (cf. footnote 3) struggles with these aspects (Boleda and Herbelot 2016; see also the work discussed in section 3.2 on entailment), which has motivated aforementioned attempts to integrate formal and distributional semantics (e.g., Beltagy et al. 2013;Erk 2013;Asher et al. 2016;Boleda and Herbelot 2016). Put simply, distributional semantics struggles because there are no entities or truth values in distributional space to refer to. Nevertheless, we think that this isn't a shortcoming of distributional semantics; we argue that a theory of expression meaning shouldn't model these aspects. 12 We think that these referential notions on which formal semantics has focused are best understood to reside at the level of speaker meaning, not expression meaning. In a nutshell, our position is that words don't refer, speakers do (e.g., Strawson 1950) -and analogously for truth conditions and entailment. The fact that speakers often refer by means of linguistic expressions doesn't entail that these expressions must in themselves, out of context, have a determinate reference, or even be capable of referring (or capable of entailing, of providing information, of being true or false). Parsimony (again) suggests that we do not assume the latter: To explain why a speaker can use, e.g., the expression "cat" to refer to a cat, it is sufficient that, in the relevant community, that is how the expression is often used. It is theoretically superfluous to assume in addition that the expression "cat" itself refers to cats. Now, most work in formal semantics would acknowledge that "cat" out of context doesn't refer to cats, and that its use in a particular context to refer to cats must be explained on the basis of a less determinate, more underspecified notion of expression meaning. More generally, expressions are wellknown to underdetermine speaker meaning (e.g., Bach 1994;Recanati 2004), as basically any example can illustrate (e.g., (1) "red cat" and (3) "got married"). However, this alone does not imply that the notions of formal semantics are inadequate for characterizing expression meaning; in principle one could try to define, in formal semantics, the referential potential of "cat" in a way that is compatible with its use to refer to cats, to cat-like things, etcetera. And one could define the expression meaning of "Jack and Jill got married" in a way that is compatible with them marrying each other and with each marrying someone else. 13 What is problematic for a formal semantic approach is that the ways in which expressions underdetermine speaker meaning are not clearly delineated and enumerable, and that there is no symbolically definable common core among all uses. 14 This argument was made for instance by Wittgenstein (1953), who notes that the uses of an expression (his example was "game") are tied together not by definition but by family resemblance. More recent iterations of this argument can be found in criticisms of the "classical", definitional view of concepts (e.g., Rosch and Mervis 1975;Fodor et al. 1980;Margolis and Laurence 2014), and in criticisms of sense enumeration approaches to word meaning (e.g., Kilgarriff 1997;Hanks 2000;Erk 2010;cf. Pustejovsky 1995), which we already mentioned briefly before: it is unclear what constitutes a word sense, and no enumeration of senses covers all uses.
The only truly common core among all uses of any given expression is that they are all, indeed, uses of the same expression. Hence, if expression meaning is to serve its purpose as a common core among all uses, i.e., as a context-invariant starting point of semantic/pragmatic explanations, then it must reflect all uses. As we argued in section 3, distributional semantics, conceived of as a model of expression meaning (i.e., the 'strong' view of Lenci 2008), embraces exactly this fact. This makes the representations of distributional semantics, but not those of formal semantics, suitable for characterizing expression meaning. By contrast, (largely) discrete notions like reference, truth and entailment are useful, at best, at the level of speaker meaning -recall that our position is that words don't refer, speakers do (Strawson, 1950). 15 That is, one can fruitfully conceive of a particular speaker, in some individuated context, as intending to refer to discrete things, communicating a certain determinate piece of information that can be true or false, entailing certain things and not others. This still involves considerable abstraction, as any symbolic model of a cognitive system would (Marr, 1982); e.g., speaker intentions may not always be as determinate as a symbolic model presupposes. But the amount of abstraction required, in particular the kind of determinacy of content that a symbolic model presupposes, is not as problematic in the case of speaker meaning as for expression meaning. The reason is that a model of speaker meaning needs to cover only a single usage, by a particular speaker situated in a particular context; a model of expression meaning, by contrast, needs to cover countless interactions, across many different contexts, of a whole community of speakers. The symbolic representations of formal semantics are ill-suited for the latter.
Despite the foregoing considerations being prominent in the literature, formal semantics has continued to assume that referents, truth conditions, etc., are core aspects of expression meaning. The main reason for this is the traditional centrality of supposedly 'semantic' intuitions in formal semantics (Bach, 2002), either as the main source of data or as the object of investigation ('semantic competence', for criticism see Stokhof 2011). In particular, formal semantics has attached great importance to intuitions about truth conditions (e.g., "semantics with no treatment of truth conditions is not semantics", Lewis 1972:169), a tenet going back to its roots in formal logic (e.g., Montague 1970 and the earlier work of Frege, Tarski, among others). Clearly, if expressions on their own do not even have truth conditions, as we have argued, these supposedly semantic intuitions cannot genuinely be about expression meaning. And that is indeed what many authors have pointed out. Strawson (1950); Grice (1975); Bach (2002), among others, have argued that what seem to be intuitions about the meaning of an expression are really about what a stereotypical speaker would mean by it -or at least they are heavily influenced by it. Again example (3) serves as an illustration here: intuitively "marry" means "marry each other", but to assume that this is therefore its expression meaning would be inadequate (as we discussed in section 3.1). But we want to stress that this is not just an occasional trap set by particular kinds of examples; just being a bit more careful doesn't cut it. It is the foundational intuition that expressions can even have truth conditions that is already inaccurate. Our intuitions are fundamentally not attuned to expression meaning, because expression meaning is not normally what matters to us; it is only an instrument for conveying speaker meaning, and, much like the way we string phonemes together to form words, it plays this role largely or entirely without our conscious awareness. The same point has been made in the more psycholinguistic literature (Schwarz, 1996), occasionally in the formal semantics/pragmatics literature (Kadmon and Roberts, 1986), and there is increasing acknowledgment of this also in experimental pragmatics, in particular of the fact that participants in experiments imagine stereotypical contexts (e.g., Westera and Brasoveanu 2014;Degen and Tanenhaus 2015;Poortman 2017).
Summing up, the standard that formal semantics has set for what a theory of expression meaning ought to account for, and which makes distributional semantics appear to fall short, turns out to be misguided. Reference, truth conditions and entailment belong at the level of speaker meaning, not expression meaning. It entails that distributional semantics on its own need not account for these aspects, either theoretically or computationally; it should only provide an adequate starting point. Interestingly, this corresponds exactly to its role in current neural network models, on tasks that involve identifying aspects of speaker meaning. Consider the task of visual reference resolution (e.g., Plummer et al. 2015), where the inputs are a linguistic description plus an image and the task is to identify the intended referent in the image. A typical neural network model would achieve this by first activating word embeddings (a form of distributional semantics; Section 2) and then combining and transforming these together with a representation of the image into a representation of the intended referent -speaker meaning.

Compositionality
Language is compositional in the sense that what a larger, composite expression means is determined (in large part) by what its components mean and the way they are put together. Compositionality is sometimes mentioned as a strength of formal semantics and as an area where distributional semantics falls short (a.o. Beltagy et al., 2013). But in fact both approaches have shown strengths and weaknesses regarding compositionality (see Boleda and Herbelot 2016 for an overview). To illustrate, consider again: The red cat is chasing a mouse.
In this context the adjective "red" is used by the speaker to mean something closer to ORANGE (because the "red hair" of cats is typically orange), unlike its occurrence in, say, "red paint". Distributional semantics works quite well for this type of effect in the composition of content words (e.g., McNally and Boleda 2017), an area where formal semantics, which tends to leave the basic concepts unanalyzed, has struggled (despite efforts such as Pustejovsky 1995). Classic compositional distributional semantics, in which distributional representations are combined with some externally specified algorithm (which can be as simple as addition), also works reasonably well for short sentences, as measured for instance on sentence similarity (e.g., Mitchell and Lapata 2010;Grefenstette et al. 2013;Marelli et al. 2014). But for longer expressions distributional semantics on its own falls short (cf. our clarification of "on its own" in footnote 3), and this is part of what has inspired aforementioned works on integrating formal and distributional semantics (e.g., Coecke et al. 2011;Grefenstette and Sadrzadeh 2011;Beltagy et al. 2013;Erk 2013;Asher et al. 2016). However, that distributional semantics falls short of accounting for full-fledged compositionality does not mean that it cannot be a sufficient model of expression meaning. For that, it should be established first that compositionality wholly resides at the level of expression meaning -and it is not clear that it does. Let us take a closer look at the main theoretical argument for compositionality, the argument from productivity. 16 According to this argument, compositionality is necessary to explain how a competent speaker can understand the meaning of a composite expression that they have never before encountered. However, in appealing to a person's supposed understanding of the meaning of an expression, this argument is subject to the revision proposed in Section 4: it reflects speaker meaning, not expression meaning. More correctly phrased, then, the type of data motivating the productivity argument is that a person who has never encountered a speaker uttering a certain composite expression, is nevertheless able to understand what some (actual or hypothetical) speaker would mean by it. And this leaves undetermined where compositionality should reside: at the level of expression meaning, speaker meaning, or both.
To illustrate, consider again example (8), "The red cat is chasing a mouse". A speaker of English who has never encountered this sentence will nevertheless understand what a stereotypical speaker would mean by it (or will come up with a set of interpretations) -this is an instance of productivity. One explanation for this would be that the person can compositionally compute an expression meaning for the whole sentence, and from there infer what a speaker would mean by it. This places the burden of compositionality entirely on the notion of expression meaning. An alternative would be to say that the person first infers speaker meanings for each word (say, the concept CAT for "cat"), 17 and then composes these to obtain a speaker meaning of the full sentence. This would place the burden of compositionality entirely on the notion of speaker meaning (cf. the notion of resultant procedure in Grice 1968; see Borge 2009 for a philosophical argument for compositionality residing at the speaker meaning level). The two alternatives are opposite extremes of a spectrum; and note that the first is what formal semantics proclaims, yet the second is what formal semantics does, given that the notions it composes in fact reside at the level of speaker meaning (e.g., concepts like CAT as opposed to THEWORDCAT; and the end product of composition in formal semantics is typically a truth value). There is also a middle way: The person could in principle compositionally compute expression meanings for certain intermediate constituents (say, "the red cat", "a mouse" and "chases"), then infer speaker meanings for these constituents (say, a particular cat, an unknown mouse, and a chasing event), and only then continue to compose these to obtain a speaker meaning for the whole sentence. This kind of middle way requires that a model of expression meaning (distributional semantics) accounts for some degree of compositionality (say, the direct combination of content words), with a model of speaker meaning (say, formal semantics) carrying the rest of the burden. The proposal in McNally and Boleda (2017) is a version of this position.
The foregoing shows that the productivity argument for compositionality falls short as an argument for compositionality of expression meanings; that is, compositionality may well reside in part, or even entirely, at the level of speaker meaning. We will not at present try to settle the issue of where compositionality resides -though we favor a view according to which compositionality is multi-faceted and doesn't necessarily reside exclusively at one level. 18 What matters for the purposes of this paper is that the requirement imposed by formal semantics, that a theory of expression meaning should account for full-fledged compositionality, turns out to be unjustified.

Outlook
We presented two strong reasons why distributional semantics is attractive as a model of expression meaning, i.e., in favor of the 'strong' view of Lenci 2008: The parsimony of regarding expression meaning as an abstraction over use; and the understanding of these abstractions as concepts and, thereby, as a necessary starting point for interpretation. Moreover, although distributional semantics struggles with matters like reference, truth conditions and entailment, we argued that a theory of expression meaning should not account for these aspects: words don't refer, speakers do (and likewise for truth conditions and entailments). The referential approach to expression meaning of formal semantics is based on misinterpreting intuitions about stereotypical speaker meaning as being about expression meaning. The same misinterpretation has led to the common view that a theory of expression meaning should be compositional, whereas in fact compositionality may reside wholly or in part (and does reside, in formal semantics) at the level of speaker meaning. Clearing this up reveals that distributional semantics is the more adequate approach to expression meaning. In between our mostly theoretical arguments for this position, we have shown how a consistent interpretation of distributional semantics as a model of expression meaning sheds new light on certain applications: e.g., distributional semantic approaches to entailment and attempts at integrating distributional and formal semantics. 17 We discuss this here as a hypothetical possibility; to assume that individual words of an utterance can be assigned speaker meanings may not be a feasible approach in general. 18 The empirical picture is undecisive in this regard: just because distributional semantics appears to be able to handle certain aspects of compositionality, that doesn't mean it should. After all, word vectors like "cat" have been quite successfully used as a proxy for extra-linguistic concepts like CAT, even though as we explained this is strictly speaking a misuse (conflating CAT and THEWORDCAT; see section 3.2). Perhaps the moderate success of distributional semantics on for instance adjective-noun composition like "red cat" reflects the fact that the extra-linguistic concepts RED and CAT compose (speaker meaning), even if the linguistic concepts THEWORDRED and THEWORDCAT don't (expression meaning).