RecoBERT: A Catalog Language Model for Text-Based Recommendations

Language models that utilize extensive self-supervised pre-training from unlabeled text, have recently shown to significantly advance the state-of-the-art performance in a variety of language understanding tasks. However, it is yet unclear if and how these recent models can be harnessed for conducting text-based recommendations. In this work, we introduce RecoBERT, a BERT-based approach for learning catalog-specialized language models for text-based item recommendations. We suggest novel training and inference procedures for scoring similarities between pairs of items, that don't require item similarity labels. Both the training and the inference techniques were designed to utilize the unlabeled structure of textual catalogs, and minimize the discrepancy between them. By incorporating four scores during inference, RecoBERT can infer text-based item-to-item similarities more accurately than other techniques. In addition, we introduce a new language understanding task for wine recommendations using similarities based on professional wine reviews. As an additional contribution, we publish annotated recommendations dataset crafted by human wine experts. Finally, we evaluate RecoBERT and compare it to various state-of-the-art NLP models on wine and fashion recommendations tasks.


Introduction
Recommendation systems are a major component of content discovery in online stores. Different recommendation systems are employed across a broad spectrum of domains, such as movies, music, groceries, and more. In each case, the recommendation system is associated with a different catalog of items comprising different descriptors, item properties, and metadata. This work deals with the case of generating item-to-item similarities based on item descriptions.
Personalized recommender systems make use of either or both Collaborative Filtering (CF) or Content-Based (CB) information (Aggarwal et al., 2016). CF approaches build models based on users past behavior (Breese et al., 2013;Schafer et al., 2007). On the other hand, CB recommenders use item meta-data such as properties, tags, and descriptions in order to build and match user and item profiles (Brusilovski et al., 2007;Wang et al., 2018c;Lops et al., 2011). A model that utilizes both CF and CB is called a hybrid recommender system.
Item-to-item recommendations are commonly used in large scale recommender systems such as on Netflix (Gomez-Uribe and Hunt, 2015), Amazon (Linden et al., 2003), Xbox (Koenigstein and Paquet, 2013) and many others. Commonly found on product details page (PDP), these nonpersonalized recommendation lists are known to drive-up purchases as well as user engagement. Similar to personalized recommendation, item similarities can be computed based on user activity, item meta-data or both, using a variety of different models. In a new store, where user data does not exist, item-to-item recommendations are computed using one or more content-based approaches that leverage item meta-data in order to compute item-to-item similarities. The extracted data may include images, videos, textual descriptions, and more.
Textual content-based recommendation systems leverage textual information about items, such as item descriptions and titles. These models usually rely on Natural Language Processing (NLP) models to compute item-to-item similarities. A naive approach to produce recommendations from textual information is to infer similarities by embedding the textual description (and title) of every item in a latent space (Lops et al., 2011;Wang et al., 2018c;De Gemmis et al., 2015). Item embeddings, that utilize textual descriptions, can be obtained via different types of language models.
Recently, self-supervised pre-training of language models have revolutionized the field of NLP. These techniques first utilize a self-supervised pretraining of a neural-based model using a large corpus of unlabeled text. Then, apply fine-tuning for specific NLP tasks. Among the recent selfsupervised pre-trained language models, BERT (Devlin et al., 2018) has emerged as a very powerful method, achieving state-of-the-art results in a variety of NLP tasks such as sentiment analysis (Sun et al., 2019), language inference (Wu and Dredze, 2019;Cui et al., 2019), sentence similarities (Reimers and Gurevych, 2019) and more. BERT pre-training technique incorporates (1) reconstruction of randomly masked words (known as masked language model), and (2) predicting whether two sentences are consecutive (next sentence prediction).
In this work, we build upon BERT and introduce a novel technique for self-supervised pre-training of catalog-based language models. In addition, we introduce an inference technique that utilizes the above model for inferring item similarities that can be used for item-to-item recommendations in cold catalogs. Hence, we name our technique Re-coBERT -a BERT model adapted for textual based recommendations.
RecoBERT pre-training leverages selfsupervision to its fullest by utilizing a combination of a masked language model along with a title-description model. The latter comprises a learning task that reveals relationships between item titles and descriptions. In some cases, these relations can form a summarization task, for which titles are short sentences that summarize the longer descriptions. In other cases, catalogs may comprise items with implicit titles that incorporate a few words that were crafted for each item at hand. For both cases, the title-description task encourages the model to reveal the underlying connections between titles and descriptions, improves language understanding, and therefore yields more accurate embeddings. This results in an improved text-based item similarity performance in cold catalogs. Importantly, RecoBERT doesn't require item similarity labels nor usage data.
We also introduce a new NLP wine recommendation task, demonstrating RecoBERT's ability to find similar items in very complex domains. The task utilizes a publicly available dataset comprising 120K elaborate wine descriptions written by wine experts. The goal is to produce wine recommendations for each item in the dataset, in the form of other similar wines. We employed a professional wine sommelier to manually craft 1095 recommendations for ∼100 wines that form a "ground-truth" test-set for evaluations. For reproducibility, and as an additional contribution, we made these annotations publicly available 1 .
Importantly, the novel wine recommendations task introduced in this work is different and more complex than most NLP tasks usually considered. The wine reviews incorporate domain-specific semantics, taxonomy, and phrases, as well as picturesque descriptions of tastes, aromas, and colors. Arguably, determining similarities between wine reviews is a challenging task, which requires a high level of intelligence and knowledge even to the average human. Specifically, compared to the tasks presented in the GLUE benchmark (Wang et al., 2018b), for which the average adult person can easily solve a query in few seconds, determining the similarity of wines based on their reviews may pose a challenge to most people and takes up to a few minutes even to wine enthusiasts and professionals.
The main contribution of this paper is threefold: (1) We introduce RecoBERT, a self-supervised training for catalog-based language model. (2) We introduce a novel inference technique that yields item-to-item similarities by leveraging RecoBERT, and compare its performance against relevant baselines. (3) we introduce a novel complex NLP task of wine recommendations and publish a matched labeled test set crafted by a professional sommelier.

Related Work
Recent methods in text-based recommendations suggest a hybrid approach that combines usage data with either traditional or neural-based NLP methods. In (de Souza Pereira Moreira et al., 2018;Zheng et al., 2017), the authors suggest a hybrid approach for recommendations that utilizes both session data (CF) and textual features from articles extracted by a convolutional neural network (CNN). Additionally, in Wang et al. (2015); Djuric et al. (2015) the authors proposed hierarchical Bayesian models for learning a joint representation for textual content and personal ratings, using latent Dirichlet allocation (LDA), deep au- Figure 1: RecoBERT receives title-description pairs corresponding to positive ("real") and negative ("fake") samples, extracted from a given catalog. (a) during training, the title-description pairs are propagated through the BERT backbone and transformed into two feature vectors. These vectors are then fed into the TDM, minimizing a cosine loss between them. (b) in inference, four scores are computed. Two scores propagate the seed and candidate items separately ("real" pairs). The other two scores utilize the TDM head and propagate title-description pairs extracted from both seed and candidate items ("fake" pairs). toencoders, and word2vec (Mikolov et al., 2013). In contrast to the above methods, the model in this paper doesn't depend on usage data and hence can be applied to completely cold catalogs. Recently, (Gong and Zhang, 2016) proposed attentive CNN for performing hashtag recommendations for tweets. This method solely depends on text, but requires supervision for similarity. Unlike this method, our model focuses on textual catalogs and doesn't require item-to-item similarity labels.
A recently proposed family of Transformerbased language models (Devlin et al., 2019; uses multiple attention layers and a two-phase training procedure composed of unlabeled pre-training and supervised fine-tuning. These models show great promise in linguistic tasks, and were shown to exceed humanlevel baselines in specific tasks such as machine translation (Vaswani et al., 2017a), question answering , and other related tasks (Wang et al., 2018a). These models utilize sentence embedding techniques (Palangi et al., 2016), where a text is encoded into low dimensional vectors that summarize the information in the input text. For example, in universal sentence encoder (USE) (Cer et al., 2018), the authors suggest utilizing vectors extracted from a machine translation model for transfer learning to other NLP tasks.
Lately, ; Storks et al. (2019); Aßenmacher and Heumann (2020) claimed that human baselines are being surpassed by Transformerbased models and others that exploit statistical cues in the well-known GLUE set (Wang et al., 2018b). Such models may suffer severe performance degradation when putting to use on real-world problems.
Hence, some argue that the tasks in the GLUE dataset no longer suffice for evaluating language understanding models.
In this work, we propose a new language task that is much more complicated than the semantic similarity tasks in GLUE. Motivated by extracting item similarities for recommender systems, our task is neither composed of single sentences nor sentence pairs. Instead, the goal is to induce semantic similarity between wine items represented by sentence-paragraph pairs. Due to the complexity of the wines domain, as well as the professional language and length of the wine reviews, our novel language understanding task requires a high level of intelligence and knowledge that exceeds the average human level.

Methodology
be the vocabulary of tokens in a given language. Let T be the set of all possible sentences generated by W, including the empty sentence. Additionally, let D be the set of all possible paragraphs generated by T . Let be a catalog of items, where each item is associated with a titledescription pair (titles are sentences, and descriptions are paragraphs). Given a catalog C, the task is to infer a similarity function F : C × C → R, that scores the similarity between any pair of items s, m ∈ C. In particular, F can be used to quantify a similarity score that ranks all the items in the catalog according to their semantic textual similarity with a given seed item s ∈ C.

Model Architecture and Loss Functions
RecoBERT is a function B : T × D → R h × R h , which utilizes a BERT-Large architecture (Devlin et al., 2019) with a hidden layer size of h, and incorporates (1) a title-description model (TDM) for scoring the relation between titles and descriptions, and (2) a mask language model (MLM) for specializing in a given domain. A dataset of n training samples is represented as pairs (t i , d i ) ∈ T × D, indexed by i = 1...n. Each pair is associated with a label y i ∈ {0, 1}, indicating whether t i and d i correspond to the same item.
Following the MLM procedure in (Devlin et al., 2019) , masks 15% of them and adds the special CLS and SEP tokens. This input sequence is then mapped to a sequence of latent embedding tokens by propagating the input through BERT where each latent token corresponds to its matched input token. Two feature vectors are then computed by Importantly, F t i and F d i correspond to the title and description of the input, respectively.
It is important to clarify the distinction between BERT and RecoBERT. Bert yields contextualized embeddings (as defined in Eq. 3.1), and can be replaced by any other language model. On the other hand, RecoBERT is defined as: RecoBERT loss function is composed of two components, a TDM loss, and an MLM loss. The purpose of TDM is to learn the relationship between item titles and descriptions. To this end, we feed the model with both positive ("real") titledescription pairs, for which both title and description belong to the same item, and negative ("fake") pairs, where the title and description are taken from two different items.
The TDM loss term utilizes a cosine head C T DM : R h × R h → R, that scores the relation between a title-description pair. Formally, and the TDM loss is defined as The purpose of the MLM is to specialize Re-coBERT's language model on the specifics of the domain and catalog at hand. As we shall see later, this has major significance in complex NLP tasks such as wine recommendations where the semantic meaning of certain words differs from their usual semantic meaning.
The MLM loss follows the paradigm presented in (Devlin et al., 2018), utilizes a classifier C M LM : R d → R |W| that projects the embedded tokens to the vocabulary space, and applies a softmax function to infer pseudo-probabilities. The MLM loss function can be expressed where z i is a sequence of index pairs (l, k) that correspond to the ith training sample, l and k are the indices of the masked token in BERT (I(t i , d i )) and the vocabulary W, respectively. In summary, the total loss for RecoBERT is defined as L total = L M LM + L T DM .

Training
We split the dataset into a train and validation sets. The validation set is used for early stopping, as we have found it essential, especially for smallersized datasets. RecoBERT backbone is initialized by the prescribed weights of the publicly available pre-trained BERT model, while the TDM head is initialized from scratch.
During training, we iterate over the items in the train set, generating positive and negative samples by switching the description to that of another item with probability p s = 0.5.Then, the positive and negative labels are assigned accordingly. The RecoBERT model and training is illustrated in Fig. 1(a).

Inference
RecoBert's inference proceeds by generating four scores. First, we propagate every item (t i , d i ) ∈ C through RecoBERT, extracting F t i and F d i , as defined in Eq. 1. Then, given a seed item s = (t s , d s ) ∈ C, and for any item m = s, m = (t m , d m ) ∈ C, we define the two cosine scores denoted by Cos D (s, m) := cosine(F d s , F d m ) and Cos T (s, m) := cosine(F t s , F t m ). These two cosine scores represent the similarity between (1) the seed and candidate titles, and (2) the seed and candidate descriptions.
Next, we utilize the learned TDM head to compute additional two cosine scores. Specifically, we propagate the pairs (t m , d s ) and (t s , d m ) through RecoBERT, extracting C T DM (B(t m , d s )) and C T DM (B(t s , d m )), respectively. These two scores approximate the similarity between the candidate title and the seed description, and between the seed title and the candidate description.
Finally, we normalize each score separately, across all candidate items, to have a zero-mean and a unit-variance, and define the total score as follows: where λ 1 . . . λ 4 are set to 1, and the item-to-item recommendations are obtained by sorting the candidate items according to I total , in a descending order. RecoBert's inference scheme is depicted in Fig. 1(b).

Wine Recommendations from Reviews
We introduce a novel NLP recommendation task of finding wine recommendations from reviews. The task is based on a publicly available dataset from Kaggle 2 , and a new test set, annotated by a professional wine sommelier. A common obstacle in evaluating similarity models is the lack of a relevant test-set or ground-truth. Therefore, as part of this paper's contributions, we made this test publicly available. The Kaggle dataset, together with our annotated ground truth, form a new text-based recommendation task that can be further used by others in the future.

The Wine Dataset
The Kaggle wine dataset comprises of 120K wine titles and reviews. Each title is composed of: (1) winery name, (2) wine year, (3) wine name, and (4) grape variety. The reviews are single paragraphs descriptions written by wine experts, delineating taste, aromas, and other wine characteristics.
The descriptions frequently use a nonliteral, symbolic jargon common with wine enthusiastic and Oenologists. For example, wine sweetness can be 2 https://www.kaggle.com/zynicide/wine-reviews identified by five intensity levels, including bonedry, dry, off-dry, sweet, and very sweet. These intensity levels substantially affect the similarity between wines. Hence, the task of wine recommendations might be considered as more complex and more difficult than many other classical NLP tasks such as sentiment analysis or question answering. While these classical tasks are relatively very simple for most humans, the wine recommendations task is arguably more difficult and convoluted even to intelligent humans.
Generally, inferring wine similarity requires the solution of the following language understanding challenges: 1. Characteristic Intensities Wines comprise different characteristics with different intensity levels.
2. Characteristic Categories Taste and aroma are classified into associative categories, and some classes are more distinct than others. For example, apple and citrus are two distinct categories of taste. Given a wine with a hint of apple, a recommendation for a wine with citrus characteristics is inadvisable by most professionals. In this example, the additional difficulty stems from the fact that a general (non-specialized) language model may consider "apple" and "citrus" to be relatively close as both are fruits.

Domain-specific Semantics and Taxonomy
Compared to general language, the wine domain incorporates professional jargon with unique phrases, different semantics, and unique taxonomy. For example, the semantic opposite of the word dry in the English language is usually the word wet, however, in the context of wines, it is the word sweet. Similarly, the opposite of white is generally black where in the wine domain it is the word red.

Non-literal Figurative Descriptions
Professional wine reviews incorporate symbolic descriptions that depart from their literal meaning. For example, one reviewer unfavorably described a wine named "Riscal 1860" using the words "Bulky and clumsy", which implies that the combination between acidity, tannins, alcohol, and sugars, is out of balance. Fig. 2 presents two representative samples from the dataset. The top example is a red wine, named "Maucho Reserva". Its description incorporates domain-specific phrases, such as "tannic", cate- gorial flavors, such as "raspberry", "plum", "coffee" and more. The description incorporates figurative terms, such as "chunky and muscular" and "texturally sound finish". The second example is "Vulká Bianco", a white with a relatively straight forward description expressing the different flavor categories, and the intensity level of the "acidity".

The Expert Annotations Set
Unlike collaborative filtering models, Contentbased item to item similarity/recommendation models are very hard to evaluate. Hence, we collected a test set, annotated by professional wine sommeliers, comprising of 1095 wine recommendations to 100 wines. The sommeliers were asked to choose representative "seed" items and annotate each with ∼10 other wines that share similar characteristics with the seed item. For the sake of reproducibility and as an additional contribution, we made these annotations publicly available 3 .
Fig 3 exhibits one sample from the annotated expert recommendations. As can be seen in the figure, the seed and the recommended item share similar phrases, such as "ripe of black-skinned fruit", "smooth" and "velvety tannins" from the recommended item, that can be associated with phrases in the seed item, including "black-skinned berry", "smooth accessible palate" and "supple tannins".

Baseline Models
We compare RecoBERT with the following models: Universal Sentence Encoder (USE) suggests to leverage feature vectors extracted from a Transformer model (Vaswani et al., 2017b) for transfer learning tasks. The Transformer architecture is composed of encoder and decoder networks. During the forward pass, the Transformer receives text in a source language, forwards it through the encoder, outputs a feature vector, feeds it into the decoder, which then generates text in the target language. USE (Cer et al., 2018) utilizes the above intermediate feature vector for transfer learning to other NLP tasks, including semantic textual similarity (such as STS Benchmark (Agirre et al., 2012)), sentiment analysis (Sun et al., 2019), etc. In our work, we employ USE to generate separate embedding for every item title and description.
Pre-trained-BERT is the pre-trained BERT-Large model from (Devlin et al., 2018). This model was trained using a large corpus of unlabeled text, to both optimize the masked language model and the next sentence prediction (NSP) task. Since, in most datasets, item similarity labels do not exist, we can not fine-tune this model for the item similarity task. Instead, we utilize the pre-trained BERT model as a feature extractor, and extract the feature vectors F t m and F d m (see Equ. 1), for every item in the catalog.
Specialist-BERT is a BERT-Large model that continued pre-training using a domain-specific corpus. Specifically, we create a specialized corpus by extracting the description paragraphs of all the items in the given catalog. Then, we iterate over sentence pairs extracted from the above corpus and continue training the pre-trained BERT with the identical BERT pre-training technique, as presented at (Devlin et al., 2018). We train this model with similar settings used for RecoBERT (i.e. trainvalidation split, 1.5M training steps, etc.). Feature vectors are extracted in the same way as for the above Pre-trained BERT model.
MoverScore employs a contextualized embedding model and a variant of the Earth Mover Distance (Rubner et al., 2000) to measure the similarity between sentence-pairs (Zhao et al., 2019). Given two sentences, MoverScore aligns similar words from each sentence and computes the flow traveling between these words. MoverScore has recently emerged as a promising text similarity metric for text generation tasks, including summarization, machine translation, image captioning, and data-to-text generation. In our experiments, we utilize the MoverScore technique on top of the Specialist-BERT model.
Inferencing baseline models, besides the Mover-Score, utilize the inference technique presented in section 3.3, by setting λ 3 and λ 4 to 0 (i.e. applying the sum λ 1 Cos D (s, m) + λ 2 Cos T (s, m)). For the USE inferencing, we replace the underlying feature vectors (F d s , F d m ) and (F t s , F t m ) by those extracted from USE. The MoverScore baseline is applied with its own scoring technique (Zhao et al., 2019), utilizing the EMD between the latent representations of the words.

Quantitative Metrics
Hit Ratio at k (HR@k) HR@k is the percentage of the predictions made by the model, where the true item was found in the top k items suggested by the model. Specifically, a seed-candidate pair is scored with 1 if the candidate item is ranked within the top k recommendations produced by the model w.r.t. to the seed, otherwise 0. Then the average over all seed-candidate examples in the test set is reported.

Mean Reciprocal Rank (MRR)
This measure is defined as the average of the reciprocal ranks considering the entire set of ranked items (and not just the top-k). In contrast to HR, the MRR metric takes into consideration the exact order within the recommendation list.
Mean Percentile Rank (MPR) Given a seed item, the percentile rank is the rank that was assigned by the recommendation model to the correct item (to be retrieved), divided by the number of ranked items. This quantity computed for all the items in the test set and then being averaged.
For more details, we refer the reader to (Resnick and Varian, 1997).

Wine Recommendations Results
For the wine dataset, we compare RecoBERT with all four baselines by three different evaluations. The first two evaluations conduct item similarities by solely relying on item descriptions or item titles (but not both), and ranking 120K wine items. The third evaluation utilizes both item titles and descriptions, ranking the subset of the expert annotated wines.
In Tab.1, we report MPR, MRR, and five HR@k scores, for each evaluation, using the 1095 expert annotations. In the upper and middle parts of the table, all models solely utilize item descriptions and item titles, respectively. In both evaluations, each model ranked the entire 120K wines in the catalog, for each seed. To make a clean comparison between RecoBERT and the other BERT-based models, in these experiments, we have evaluated all BERT-based models (including RecoBERT) with the same inference score. Specifically, for the descriptions evaluations (upper part) we set all BERTbased models to solely use the Cos D (s, m) score (by configuring λ 1 to 1 and setting the other λs with 0, i.e. we set λ 2 , λ 3 and λ 4 in RecoBERT to 0, and λ 2 to 0 in the other BERT-based baselines). In similar, the titles evaluations (middle part) we solely utilize the Cos T (s, m) score (setting λ 2 to 1 and eliminating the rest of the scores). The MoverScore in each section utilizes the textual information associated with its name.
In the bottom part of the table, we report the performance of all models, utilizing both item titles and descriptions, comparing against the full Re-coBERT inference, as presented in the section 3.3. In these evaluations, the reported MoverScore both separately applies the MoverScore on item titles and descriptions, ranking the items in the catalog by computing the sum of both scores.
The results in the  RecoBERT scored 21.04%, while the baseline models ranged between 4.31% (for USE) and 11.6% (for specialist-BERT). In addition, RecoBERT presents superior performance on all HR metrics, sometimes improving by a factor of two, even compared to specialist-BERT and MoverScore which yield the best performance among the baseline models. This can be attributed to the importance of the title-description learning task, and to the benefit gained by the TDM head, which produces more accurate embedding under a cosine metric.
Notably, in the same description-based evaluations (upper part of the table), RecoBERT yields 10.4% in the HR@10 metric. This entails that on average, for each seed, RecoBERT was able to re-trieve roughly one out of ∼10 expert annotations, in the top ten recommendations list, by ranking 120K candidate items. Remarkably, ∼10 annotated items represent ∼0.0083% of the entire catalog.
Additionally, as can be seen in the bottom part of the table, RecoBERT with the full inference yields better performance, by a sizeable margin, compared to all other models, including the same Re-coBERT applied with the baseline inference (which solely utilizes the Cos D (s, m) and Cos T (s, m) scores). The latter is evidence for the benefit of applying the full inference method, which also utilizes the TDM head by propagating titledescription pairs extracted from seed-candidate items.

Fashion Recommendations Results
We evaluate RecoBERT on a fashion catalog incorporating 4K items and compare its performance with all four baseline models. Similar to the wines evaluations, all BERT-based models were initialized with the Pre-trained BERT weights and continued pre-training using the text extracted from the fashion catalog. During inference, all models used both item titles and descriptions.
To assess the quality of the recommendations, we report human scoring conducted by a fashion expert. The same test set, composed of 100 seed items, was ranked by all models. The scoring was performed blindly, as the source model for each sample was hidden from the expert. For each seed, the expert ranked the top five recommended items,  by a total score of 0 to 5, indicating poor to excellent performance As can be seen in Tab. 2, RecoBERT outperforms all baselines, including the ones that utilize the BERT model that was specialized in the fashion domain. Specifically, RecoBERT has gained a relative improvement of 9.4% and 5.9% compared to specialist-BERT and MoverScore, respectively. See the supplementary materials for more results of RecoBERT applied to the fashion dataset.

Ablation Study
Tab. 3 presents an ablation study for RecoBERT inference, evaluated on the subset of the wine expert annotations. Six variants are considered, each eliminates different scores from RecoBERT inference, by setting their matched λs with 0. The results, shown in the table, indicate that it is crucial to employ all four scores, in the way it is done in Re-coBERT, and that extracting information from both item titles and descriptions is highly beneficial for item similarity performance.

Computational Costs
We report computation times that were measured for RecoBERT training and inference, by utilizing a single NVIDIA V100 32GB GPU using PyTorch framework. For the wines catalog, we trained Re-coBERT for 1.5M training steps. This training took ∼5 days. RecoBERT training on the fashion catalog, comprised 150K steps and took ∼12 hours. Inferencing RecoBERT with the same GPU allows a throughput of 340 items per second. This enables recommending the entire fashion catalog in ∼7 hours, and the wines test set in 9.5 hours. Notably, all these computations are applied once, for a given catalog, and can be executed in an offline manner and cached for later use. To further accelerate the computation time of the two C T DM scores applied through RecoBERT inference, one can adopt knowledge distillation techniques, such as (Barkan et al., 2019;Jiao et al., 2019;Lioutas et al., 2019), which are beyond the scope of this work.

Discussion and Conclusions
In this work, we introduce a novel natural language recommendation task along with a novel annotated test set that together contribute to the state-of-theart research of text-based recommenders and language models. We present RecoBERT -A model for text-based item similarity that (a) mitigates the discrepancy between training and inference phases in the classical BERT model, by operating on sentence-paragraph pairs, (b) refines the backbone language model to provide more accurate embeddings, improving item similarities under the cosine metric, and (c) utilizes matched cosine scores as part of the inference process. In addition, we show that the unique mechanism behind RecoBERT leads to significant improvements over the other baselines and across all metrics.
RecoBERT's preeminence stems from two properties of its TDM loss: First, feeding titledescription pairs allows RecoBERT to apply crossattention between the tokens of both elements entailing an effective dependency between their embeddings. Second, by leveraging the TDM task, RecoBERT learns an additional task for revealing the underlying connections between item titles and descriptions, which reinforces the model to better specialize in the domain at hand.
In some cases, where titles do not correlate with item descriptions 4 , RecoBERT can be extended to utilize textual tags 5 , by either concatenating the tags with the title, or feeding them as an extra input to RecoBERT.
Compared to other semantic textual similarity tasks, the proposed wine recommendations task, along with our published annotated test set, can shed light on the limitations as well as the key advances of state-of-the-art NLP models for recommendations. In addition, by publishing our annotated wine recommendations dataset, we intend to encourage the community to further explore the boundaries of other NLP models, assessing the ability of machines to understand complex human language.

A Qualitative Results
Fig 4 exhibits the top three recommendations for two representative wine seeds from the annotated test set, predicted by RecoBERT, while (1) ranking the entire 120K wines in the catalog (left column), and (2) ranking the subset of 1095 annotated dataset (right column). As can be seen in the figure, the third item retrieved by RecoBERT was also annotated by the human wine expert ("hit"). Specif-ically, for the "Gamay" seed, the origin of the first recommended wine is a neighbor country of the seed's origin, and both items share similar berries aromas and body intensity. The second wine is a Pinot Noir, known as Gamay's "cousin", and shares similar berries aromas. The third item, which is one out of ten expert annotations for this seed, shares the same seed origin, aromas, and flavors.
For the seed "Domaines Devillard", the top three recommendations retrieved by RecoBERT were all annotated by the wine expert. For this seed, all the recommendations share the same aging potential. Also, both the seed, first and third items are Cru wines, sharing the same origin and variety. The origin of the second item is Burgundy, which is considered a high-quality vineyard, similar to Cru. Fig. 5 exhibits a representative seed from the fashion catalog, along with the top three recommendations ranked by RecoBERT. As can be seen in the figure, the top three recommended items for "V-neck Cashmere Sweater" are (1) consistent with the sweaters sub-category, (2) comprise the same material or indicated by a "PREMIUM QUALITY" label, and (3) preserve similar style properties, such as V-neck, and ribbing in a few locations on the sweater.