Language in a (Search) Box: Grounding Language Learning in Real-World Human-Machine Interaction

We investigate grounded language learning through real-world data, by modelling a teacher-learner dynamics through the natural interactions occurring between users and search engines; in particular, we explore the emergence of semantic generalization from unsupervised dense representations outside of synthetic environments. A grounding domain, a denotation function and a composition function are learned from user data only. We show how the resulting semantics for noun phrases exhibits compositional properties while being fully learnable without any explicit labelling. We benchmark our grounded semantics on compositionality and zero-shot inference tasks, and we show that it provides better results and better generalizations than SOTA non-grounded models, such as word2vec and BERT.


Introduction
Most SOTA models in NLP are only intratextual. Models based on distributional semantics -such as standard and contextual word embeddings (Mikolov et al., 2013;Peters et al., 2018;Devlin et al., 2019) -learn representations of word meaning from patterns of co-occurrence in big corpora, with no reference to extra-linguistic entities.
While successful in a range of cases, this approach does not take into consideration two fundamental facts about language. The first is that language is a referential device used to refer to extra-linguistic objects. Scholarly work in psycholinguistics (Xu and Tenenbaum, 2000), formal semantics (Chierchia and McConnell-Ginet, 2000) and philosophy of language (Quine, 1960) show that (at least some aspects of) linguistic meaning can be represented as a sort of mapping between linguistic and extra-linguistic entities. The second is * Corresponding author. Authors contributed equally and are listed alphabetically. that language may be learned based on its usage and that learners draw part of their generalizations from the observation of teachers' behaviour (Tomasello, 2003). These ideas have been recently explored by work in grounded language learning, showing that allowing artificial agents to access human actions providing information on language meaning has several practical and scientific advantages (Yu et al., 2018;Chevalier-Boisvert et al., 2019).
While most of the work in this area uses toy worlds and synthetic linguistic data, we explore grounded language learning offering an example in which unsupervised learning is combined with a language-independent grounding domain in a realworld scenario. In particular, we propose to use the interaction of users with a search engine as a setting for grounded language learning. In our setting, users produce search queries to find products on the web: queries and clicks on search results are used as a model for the teacher-learner dynamics.
We summarize the contributions of our work as follows: 1. we provide a grounding domain composed of dense representations of extra-linguistic entities constructed in an unsupervised fashion from user data collected in the real world.
In particular, we learn neural representations for our domain of objects leveraging prod2vec (Grbovic et al., 2015): crucially, building the grounding domain does not require any linguistic input and it is independently justified in the target domain (Tagliabue et al., 2020a). In this setting, lexical denotation can also be learned without explicit labelling, as we use the natural interactions between the users and the search engine to learn a noisy denotation for the lexicon (Bianchi et al., 2021). More specifically, we use DeepSets (Cotter et al., 2018) constructed from user behavioural signals as the extra-linguistic reference of words. For in-stance, the denotation of the word "shoes" is constructed from the clicks produced by real users on products that are in fact shoes after having performed the query "shoes" in the search bar. Albeit domain specific, the resulting language is significantly richer than languages from agent-based models of language acquisition (Słowik et al., 2020;Fitzgerald and Tagliabue, 2020), as it is based on 26k entities from the inventory of a real website.
2. We show that a dense domain built through unsupervised representations can support compositionality. By replacing a discrete formal semantics of noun phrases (Heim and Kratzer, 1998) with functions learned over DeepSets, we test the generalization capability of the model on zero-shot inference: once we have learned the meaning of "Nike shoes", we can reliably predict the meaning of "Adidas shorts". In this respect, this work represents a major departure from previous work on the topic, where compositional behavior is achieved through either discrete structures built manually (Lu et al., 2018;Krishna et al., 2016), or embeddings of such structures (Hamilton et al., 2018).
3. To the best of our knowledge, no dataset of this kind (product embeddings from shopping sessions and query-level data) is publicly available. As part of this project, we release our code and a curated dataset, to broaden the scope of what researchers can do on the topic 1 .
Methodologically, our work draws inspiration from research at the intersection between Artificial Intelligence and Cognitive Sciences: as pointed out in recent papers (Bisk et al., 2020;Bender and Koller, 2020), extra-textual elements are crucial in advancing our comprehension of language acquisition and the notion of "meaning". While synthetic environments are popular ways to replicate child-like abilities (Kosoy et al., 2020;Hill et al., 2020), our work calls the attention on real-world Information Retrieval systems as experimental settings: cooperative systems such as search engines offer new ways to study language grounding, in between the oversimplification of toy models and the daunting task of providing a general account of the semantics of a natural language. The chosen IR domain is rich enough to provide a wealth of data and possibly to see practical applications, whereas at the same time it is sufficiently self-contained to be realistically mastered without human supervision.

Methods
Following our informal exposition in Section 1, we distinguish three components, which are learned separately in a sequence: learning a languageindependent grounding domain, learning noisy denotation from search logs and finally learning functional composition. While only the first model (prod2vec) is completely unsupervised, it is important to remember that the other learning procedures are only weakly supervised, as the labelling is obtained by exploiting an existing user-machine dynamics to provide noisy labels (i.e. no human labeling was necessary at any stage of the training process).
Learning a representation space. We train product representation to provide a "dense ontology" for the (small) world we want our language to describe. Those representations are known in product search as product embeddings (Grbovic et al., 2015): prod2vec models are word2vec models in which words in a sentence are replaced by products in a shopping session. For this study, we pick CBOW (Mu et al., 2018) as our training algorithm, and select d = 24 as vector size, optimizing hyperparameters as recommended by ; similar to what happens with word2vec, related products (e.g. two pairs of sneakers) end up closer in the embedding space. In the overall picture, the product space just constitutes a grounding domain, and re-using tried and tested (Tagliabue et al., 2020b) neural representations is an advantage of the proposed semantics.
Learning lexical denotation. We interpret clicks on products in the search result page, after a query is issued, as a noisy "pointing" signal (Tagliabue and Cohn-Gordon, 2019), i.e., a map between text ("shoes") and the target domain (a portion of the product space). In other words, our approach can be seen as a neural generalization of model-theoretic semantics, where the extension of "shoes" is not a discrete set of objects, but a region in the grounding space. Given a list of products clicked by shoppers after queries, we represent meaning through an order-invariant op-eration over product embeddings (average pooling weighted by empirical frequencies, similar to ); following Cotter et al. (2018), we refer to this representation as a DeepSet. Since words are now grounded in a dense domain, set-theoretic functions for NPs (Chierchia and McConnell-Ginet, 2000) need to be replaced with matrix composition, as we explain in the ensuing section.
Learning functional composition. Our functional composition will come from the composition of DeepSet representations, where we want to learn a function f : DeepSet × DeepSet → DeepSet. We address functional composition by means of two models from the relevant literature (Hartung et al., 2017): one, Additive Compositional Model (ADM), sums vectors together to build the final DeepSet representation. The second model is instead a Matrix Compositional Model (MDM): given in input two DeepSets (for example, one for "Nike" and one for "shoes") the function we learn as the form M v + N u, where the interaction between the two vectors is mediated through the learning of two matrices, M and N . Since the output of these processes is always a DeepSet, both models can be recursively composed, given the form of the function f .

Experiments
Data. We obtained catalog data, search logs and detailed behavioral data (anonymized product interactions) from a partnering online shop, Shop X. Shop X is a mid-size Italian website in the sport apparel vertical 2 . Browsing and search data are sampled from one season (to keep the underlying catalog consistent), resulting in a total of 26, 057 distinct product embeddings, trained on more than 700, 000 anonymous shopping sessions. To prepare the final dataset, we start from comparable literature (Baroni and Zamparelli, 2010) and the analysis of linguistic and browsing behavior in Shop X, and finally distill a set of NP queries for our compositional setting.
In particular, we build a rich, but tractable set by excluding queries that are too rare (<5 counts), queries with less than three different products clicked, and queries for which no existing product embedding is present. Afterwards, we zoom into NP-like constructions, by inspecting which features are frequently used in the query log (e.g. shoppers search for sport, not colors), and matching logs and NPs to produce the final set. Based on our experience with dozens of successful deployments in the space, NPs constitute the vast majority of queries in product search: thus, even if our intent is mainly theoretical, we highlight that the chosen types overlap significantly with real-world frequencies in the relevant domain. Due to the power-law distribution of queries, one-word queries are the majority of the dataset (60%); to compensate for sparsity we perform data augmentation for rare compositional queries (e.g. "Nike running shoes"): after we send a query to the existing search engine to get a result set, we simulate n = 500 clicks by drawing products from the set with probability proportional to their overall popularity (Bianchi et al., 2021) The final dataset consists of 104 "activity + sortal" 4 queries -"running shoes" -; 818 "brand + sortal" queries -"Nike shoes" -, and 47 "gender + sortal" queries -"women shoes"; our testing data consists of 521 "brand + activity + sortal" (BAS) triples, 157 "gender + activity + sortal" (GAS) triples, 406 "brand + gender + activity + sortal" (BGAS) quadruples. 5 Tasks and Metrics. Our evaluation metrics are meant to compare the real semantic representation of composed queries ("Nike shoes") with the one predicted by the tested models: in the case of the proposed semantics, that means evaluating how it predicts the DeepSet representation of "Nike shoes", given the representation of "shoes" and "Nike". Comparing target vs predicted representations is achieved by looking at the nearest neighbors of the predicted DeepSet, as intuitively complex queries behave as expected only if the two representations share many neighbors. For this reason, quantitative evaluation is performed using two well-known ranking metrics: nDCG and Jac-3 Since the only objects users can click on are those returned by the search box, query representation may in theory be biased by the idiosyncrasies of the engine. In practice, we confirmed that the embedding quality is stable even when a sophisticated engine is replaced by simple Boolean queries over TF-IDF vectors, suggesting that any bias of this sort is likely to be very small and not important for the quality of the compositional semantics. 4 "Sortal" refers to a type of object: shoes and polo are sortals, while black and Nike are not; "activity" is the sport activity for a product, e.g. tennis for a racket. 5 Dataset size for our compositional tests is in line with intra-textual studies on compositionality (Baroni and Zamparelli, 2010;Rubinstein et al., 2015); moreover, the lexical atoms in our study reflect a real-world distribution that is independently generated, and not frequency on general English corpora.  card (Vasile et al., 2016;Jaccard, 1912). We focus on two tasks: leave-one-brand-out (LOBO) and zero-shot (ZT). In LOBO, we train models over the "brand + sortal" queries but we exclude from training a specific brand (e.g., "Nike"); in the test phase, we ask the models to predict the DeepSet for a seen sortal and an unseen brand. For ZT we train models over queries with two terms ("brand + sortal", "activity + sortal" and "gender + sortal") and see how well our semantics generalizes to compositions like "brand + activity + sortal"; the complex queries that we used at test time are new and unseen.

LOBO
Models. We benchmark our semantics (tagged as p in the results  / umberto-commoncrawl-cased-v1 to the product-space (essentially, training to predict the DeepSet representation from text). The generalization to different and longer queries for UM comes from the embeddings of the queries themselves. Instead, for W2V, we learn a compositional function that concatenates the two input DeepSets, projects them to 24 dimensions, pass them through a Rectified Linear Unit, and finally project them to the product space. 7 We run every model 15 times and report average results; RMSProp is the chosen optimizer, with a batch size of 200, 20% of the training set as validation set and early stopping with patience = 10.
Results. Table 1 shows the results on LOBO, with grounded models outperforming intra-textual ones, and prod2vec semantics (tagged as p) beating all baselines. Table 2 reports performance for different complex query types in the zero-shot inference task: grounded models are superior, with the proposed model outperforming baselines across all types of queries.
MDM typically outperforms ADM as a composition method, except for GAS, where all models suffer from gender sparsity; in that case, the best model is ADM, i.e. the one without an implicit bias from the training. In general, grounded models outperform intra-textual models, often by a wide margin, and prod2vec-based semantics outperforms image-based semantics, proving that the chosen latent grounding domain supports rich representational capabilities. The quantitative evaluations were confirmed by manually inspecting nearest neighbors for predicted DeepSets in the LOBO setting -as an example, MDM predicts for "Nike shoes" a DeepSet that has (correctly) all shoes as neighbors in the space, while, for the same query, UM suggests shorts as the answer. Figure 1 shows some examples of compositions obtained by the MDM model on the LOBO task; the last example shows that the model, given in input the query "Nike shirt", does not reply with a shirt, but with a Nike jacket: even if the correct meaning of "shirt" was not exactly captured in this contest, the model ability to identify a similar item is remarkable.

Conclusions and Future Work
In the spirit of Bisk et al. (2020), we argued for grounding linguistic meaning in artificial systems through experience. In our implementation, all the important pieces -domain, denotation, composition -are learned from behavioral data. By grounding meaning in (a representation of) objects and their properties, the proposed noun phrase semantics can be learned "bottom-up" like distributional models, but can generalize to unseen examples, like traditional symbolic models: the implicit, dense structure of the domain (e.g. the relative position in the space of Nike products and shoes) underpins the explicit, discrete structure of queries picking objects in that domain (e.g. "Nike shoes") -in other words, compositionality is an emergent phenomenon. While encouraging, our results are still preliminary: first, we plan on extending our semantics, starting with Boolean operators (e.g. "shoes NOT Nike"); second, we plan to improve our representational capabilities, either through symbolic knowledge or more discerning embedding strategies; third, we wish to explore transformer-based architectures  as an alternative way to produce set-like representations. We conceived our work as a testable application of a broader methodological stance, loosely following the agenda of the child-as-hacker (Rule et al., 2020) and child-as-scientist (Gopnik, 2012) programs. Our "search-engine-as-a-child" metaphor may encourage the use of abundant real-world search logs to test computational hypotheses about language learning inspired by cognitive sciences (Carey and Bartlett, 1978).