Answering Product-Questions by Utilizing Questions from Other Contextually Similar Products

Predicting the answer to a product-related question is an emerging field of research that recently attracted a lot of attention. Answering subjective and opinion-based questions is most challenging due to the dependency on customer generated content. Previous works mostly focused on review-aware answer prediction; however, these approaches fail for new or unpopular products, having no (or only a few) reviews at hand. In this work, we propose a novel and complementary approach for predicting the answer for such questions, based on the answers for similar questions asked on similar products. We measure the contextual similarity between products based on the answers they provide for the same question. A mixture-of-expert framework is used to predict the answer by aggregating the answers from contextually similar products. Empirical results demonstrate that our model outperforms strong baselines on some segments of questions, namely those that have roughly ten or more similar resolved questions in the corpus. We additionally publish two large-scale datasets used in this work, one is of similar product question pairs, and the second is of product question-answer pairs.


Introduction
Product-related Question Answering (PQA) is a popular and essential service provided by many e-commerce websites, letting consumers ask product related questions to be answered by other consumers based on their experience. The large archive of accumulated resolved questions can be further utilized by customers to support their purchase journey and automatic product question answering tools (e.g. Jeon et al. (2005); Cui et al. (2017); Carmel et al. (2018)). However, there are many unanswered questions on these websites, either because a newly issued question has not attracted the community attention yet, or because of many other reasons (Park et al., 2015). This may frustrate ecommerce users, in particular when their purchase decision depends on the question's answer. Automatic PQA may assist the customers and the sellers by answering these unanswered questions, based on various diversified resources.
Previous PQA approaches leverage product specifications and description information (Cui et al., 2017;Lai et al., 2018;Gao et al., 2019), as well as customer-reviews (Yu et al., 2012;McAuley and Yang, 2016;Yu and Lam, 2018;Das et al., 2019;Fan et al., 2019;Deng et al., 2020), for answering product related questions. However, there are two notable shortcomings to these two approaches. Product information can typically address questions about product features and functionality, but can't address complex and subjective questions such as opinion question (Is it good for a 10 year old?), advice-seeking question (What is the color that best fit my pink dress?), or unique usage questions (Can I play Fifa 2018 on this laptop?). Customer-reviews, on the other hand, can partially address this kind of questions (Wan and McAuley, 2016), yet there are many products with few or no reviews available, either because they are new on the site or are less popular.
We propose a novel and complementary approach for answering product-related questions based on a large corpus of PQA. Given an unanswered product question, we seek similar resolved questions 2 about similar products and leverage their existing answers to predict the answer for the customer's question. We call our method SimBA (Similarity Based Answer Prediction). For example, the answer for the question "Will these jeans shrink after a wash?", asked about a new pair of jeans on the website, may be predicted based on the answers for similar questions asked about other jeans that share properties such as fabric material, brand, or style. An example is shown in Table 1. The main hypothesis we explore in this work is whether the answer to a product question can be predicted, based on the answers for similar questions about similar products, and how reliable this prediction is.
As our method relies on the existing PQA corpus, it addresses the two mentioned shortcomings of the previous approaches. First, it can address a variety of product-related questions that are common in PQA, including subjective and usage questions. Second, our method can provide answers to new or less popular products as it leverages an existing set of similar questions from other similar products.
A key element of our proposed method is a novel concept that we refer to as Contextual Product Similarity, which determines whether two products are similar in the context of a specific question. For example, two smart-watches may be similar with regards to their texting capability but different with regards to sleep monitoring. In Section 3 we formally define this concept and propose a prediction model for measuring contextual similarity between products, with respect to a given question. Additionally, we describe an efficient method to train this model by leveraging an existing PQA corpus.
Another appealing property of SimBA is its ability to support the predicted answer by providing the list of highly similar questions upon which the answer was predicted, hence increasing users' confidence and enhancing user engagement.
Our main contributions are: (a) A novel PQA method that overcomes several shortcomings of previous methods. (b) A novel concept of Contextual Product Similarity and an effective way to automatically collect annotations to train this model. (c) Finally, publishing two large scale datasets, one is a question similarity data set and the second is a large-scale Amazon product questions and answers dataset, details are provided in Section 4.
Empirical evaluation of our method demonstrates that it outperforms a strong baseline in some question segments, and that a hybrid model is effective in all the vast majority of the questions.

Related Work
Automatic aswering product related questions has become a permanent service provided by many ecommerce websites and services (Cui et al., 2017;Carmel et al., 2018). Questions are typically answered based on product details from the catalog, existing Q&A's on the site, and customer reviews. Each of these resources, used for answer generation, has been studied extensively by the research community recently, probably due to the complexity of this task, the availability of appropriate datasets (McAuley, 2016), and the emergent increase in on-line shopping usage. Lai et al. (2018) built a question answering system based on product facts and specifications. They trained a question answering system by transfer learning from a large-scale Amazon dataset to the Home Depot domain. Gao et al. (2019) generated an answer from product attributes and reviews using adversarial learning model which is composed of three components: a question-aware review representation module, a key-value attribute graph, and a seq2seq model for answer generation. Yu et al. (2012) answered opinion questions by exploiting hierarchical organization of consumer reviews, where reviews were organized according to the product aspects.
The publication of Amazon datasets of reviews 3 and Q&As (McAuley, 2016), triggered a flood of studies on review-aware answer prediction and generation. McAuley and Yang (2016) formulated the review based question answering task as a mixtureof-experts framework -each review is an "expert" that votes on the answer to a yes/no question. Their model learns to identify 'relevant' reviews based on those that vote correctly. In a following work, Wan and McAuley (2016) observed that questions have multiple, often divergent, answers, and the full spectrum of answers should be further utilized to train the answering system.
Chen et al. (2019) described a multi-task attention mechanism which exploits large amounts of Q&As, and a few manually labeled reviews, for answer prediction. Fan et al. (2019) proposed a neural architecture, directly fed by the raw text of the question and reviews, to mark review segment as the final answer, in a reading comprehension fashion. Das et al. (2019) learned an adversarial network for inferring reviews which best answer a question, or augment a given answer. Deng et al. (2020) incorporated opinion mining into the reviewbased answer generation. Yu and Lam (2018) generated aspect-specific representation for questions and reviews for answer prediction for yes-no questions.  used transfer learning from a resource-rich source domain to a resource-poor target domain, by simultaneously learning shared representations of questions and reviews in a unified framework of both domains.
All this line of works assume the existence of rich set of product reviews to be used for question answering. This solution fails when no reviews are available. The challenge of review generation for a given product, while utilizing similar products' reviews, was addressed by Park et al. (2015). For a given product they extracted useful sentences from the reviews of other similar products. Similarly, (Pourgholamali, 2016) mined relevant content for a product from various content resources available for similar products. Both works focused on the extraction of general useful product related information rather than answering a specific product question, as in our case. Second, the product-similarity methods they considered rely on product specifications and description, and do not depend on the question to be answered, while our method considers a specific question at hand when estimating contextual product similarity.

Similarity-Based Answer Prediction
In this section, we introduce the Similarity-Based Answer-prediction (SimBA) method for predicting the answer for a product question, based on the answers for other similar product questions. We restrict our study to yes/no questions only, due to their popularity in the PQA domain (54% on our PQA dataset), and following common practices in answer prediction studies (McAuley and Yang, 2016;Yu and Lam, 2018). Figure 1 presents our prediction framework and its main components.
Formally, a question-product-answer tuple is denoted by r j = (q j , p j , a j ), where a j ∈ { yes , no }. C = {r j } N j=1 is the set of N tuples of a given product category. r t = (q t , p t , ?) 4 is the target record of an unanswered question q t , asked about product p t . We treat C as the knowledgebase we use for answering q t .
Given a target record r t , in order to predict its answer a t , we first retrieve a set of records from C with the most similar questions to q t (Figure 1, stage 1). We denote the retrieved records as siblings of r t . We then filter the siblings by applying a Question-to-Question similarity (Q2Q) model, keeping only records with highly similar questions which are expected to have the same question intent as of q t , (Figure 1, stage 2). We denote these records as twins of r t . We then apply our Contextual Product Similarity (CPS) model to measure the contextual similarity between r t and its twins ( Figure 1, stage 3). The CPS similarity score is used to weight the twins by considering them as voters, applying a mixture-of-experts model over their answers for the final answer prediction (Figure 1, stage 4). More details about the model's components, the training processes, and other specifications, are described in the following.

Sibling Retrieval
Given a target record r t , and a corpus of productquestion-answer records C, our first goal is to re-245 trieve all records with a question having the same intent as of q t . As C might be very large, applying a complex neural model to measure the similarity of each question in C to q t is often infeasible. We therefore apply a two step retrieval process. In a preliminary offline step, we index the records in C by creating embedding vectors for their questions, using a pre-trained encoder. For retrieval, done both during training and inference, we similarly embed the question q t into vector e t . We then use a fast Approximate K Nearest Neighbors (AKNN) search to retrieve K records, with the most similar questions, based on the cosine similarity between e t and the embedding vectors of the questions in C. We denote the set of retrieved siblings of r t by S(r t ).

Twin detection
The retrieved sibling records are those with the most similar questions to the target question. In the second step of the retrieval process, we enhance our record selection by applying a highly accurate transformer-based Question-to-Question (Q2Q) classifier (See Section 5.1), which we train over our question to question similarity dataset (Section 4.1). The Q2Q(q t , q k ) classifier predicts the similarity between a target question q t and each of the questions q k in S(r t ). A record r k is considered a twin of r t if Q2Q(q t , q k ) > γ, where 0.5 ≤ γ ≤ 1.0 is a hyper-parameter of the system. We denote the set of twins of r t by T (r t ).

Contextual Product Similarity (CPS)
We consider products p 1 and p 2 to be contextually similar, with respect to a yes/no question q, if the answer to q on both products is the same 5 . Given a pair of twin records (r 1 , r 2 ), our CPS model is aims to predict the contextual similarity between them, i.e. whether their (highly similar) questions have the same answer.
Since r 1 and r 2 are twins, their questions are expected to have the same intent; yet, they might be phrased differently. To avoid losing any information, we provide both questions as input to the CPS model, during training and during inference time. Figure 2 depicts the CPS model for predicting the contextual similarity 5 By design, both products belong to the same product category C, which prevents comparing unrelated products. For example, comparing an airhorn and a computer fan in the context of the question is it loud is therefore prevented. The target question-product pair (qt, pt) and the twin questionproduct pair (qj, pj) are encoded using a transformer encoder, while the questions attend the product text. The texts of both products are coupled and also encoded, allowing the two product text attend each other. The three output vectors are then concatenated and classified using an MLP classifier. between a target record r t , and one of its twinsrecord r j . For each record, the question-product pair is embedded using a pre-trained transformer encoder, allowing the product textual content and the question text attend each other 6 :

CPS Model Architecture
The two models share weights to avoid over-fitting and for more efficient learning. A second encoder embeds the textual content of both products, encapsulating the similarity between them: Then, a one hidden MLP layer takes the concatenation of the three embedding vectors, to predict the probability of a t = a j , Another key advantage of the CPS model is its ability to be trained on a large scale, without human annotations, by simply yielding the training labels directly from the polarity between the answers of twin pairs extracted from our training data. For any pair of twins (r i , r j ):  Table 2: Examples from Amazon-PQSim Dataset. Each example consists of a user-generated question pair and a human-annotated label for their similarity.

Mixture of Experts
A mixture of experts is a widely-used method to combine the outputs of several classifiers by associating a weighted confidence score with each classifier (Jacobs et al., 1991). In our setting, experts are individual twins that lend support for or against a particular answer for a question. Each twin is weighted by its contextual similarity to the target record r t , as predicted by the CPS model. Given a target record r t , the weight of each of its twins, r j ∈ T (r t ) is determined by where ψ tj = CP S(r t , r j ), and 0 ≤ w min ≤ 0.5 is a lower weight-limit; a hyper-parameter that we tune on the development set. 7 The predicted class of a t is therefore derived by P red(a t |r t ) = sign where positive/negative P red indicates 'yes'/'no' respectively, and δ(a) = +1, a = 'yes' −1, a = 'no'.
Our methodology can be easily expanded to incorporate more answer predictors (voters) of different types into SimBA. An example for such an expansion is described at Section 5.3.

Datasets
We introduce two new datasets to experiment with our answer prediction approach: 1) The Amazon Product Question Similarity (Amazon-PQSim) dataset which is used to train our Q2Q model; 2) The Amazon Product Question Answers (Amazon-PQA) dataset of product related Q&As, used for training the SimBA model.

Amazon-PQSim Dataset
We collected a first-of-a-kind question-to-question similarity dataset of product-question pairs from the Amazon website (Amazon-PQSim. See Table 2 for examples). Unlike the Quora dataset of general question pairs 8 , product questions are asked in the context of a designated product page. This makes them unique and different from questions asked in other domains. For example, the question Is it waterproof?, when appears on the Fitbit Flyer detailed page, should implicitly be interpreted as Is Fitbit Flyer waterproof?.
The following steps were taken for the data collection: (a) randomly sampling product-questions from the Amazon website. (b) filtering out some of these questions (e.g., non-English questions, for more details, see Appendix A). (c) For each of the remaining questions, we retrieved up to three candidate similar questions from the collection. A question is paired with the original question if the Jaccard similarity among them is in the range of [0.3, 0.5] . We ignore highly similar questions (> 0.5) since we don't want nearly verbatim pairs in our dataset, as well as dissimilar pairs (< 0.3).
(d) Finally we used the Appen crowd-sourcing platform 9 for manual annotation of question pairs similarity 10 . Each question pair was labeled by at least three judges, and up to seven, until reaching agreement of 70% or more.
The above steps resulted in a nearly balanced dataset (1.08 positive-negative ratio) of more than 180K product question pairs with judges agreement of 70% or more, and among them about 90K question pairs have perfect judges agreement (1.14 8 https://www.kaggle.com/c/quora-question-pairs 9 https://appen.com 10 As the questions are asked in context of a specific product, they are often written in an anaphoric form (e.g. Is it waterproof?). To keep our dataset general, we instructed the judges to accept such questions as if they included the actual related product name. For example, the pair Is it waterproof? and Is this Fitbit waterproof? were labeled as similar.

Amazon-PQA Dataset
We collected a large corpus of product questions and answers from the Amazon website, similar to the popular Amazon Q&A dataset (McAuley, 2016). Since our answer prediction method directly utilizes an existing corpus of resolved questions, we aim to collect all available questions per narrow sub-category instead of a sample of questions across broad categories by the popular Amazon Q&A dataset. For example, instead of sampling from the broad Electronics category, we collect all questions under the narrower Monitors and Receivers categories.

Raw Data Extraction
We collected all product questions, with their answers, from 100 subcategories, available on the Amazon website in August 2020. Overall, 10M questions were collected, with 20.7M answers, on 1.5M products. For full statistics of the raw data, see Table 7 in Appendix A.

Yes/No Question Classification
We followed (He and Dai, 2011) for detecting Yes/No questions using simple heuristics. See Appendix A for details.

Yes/No Answer Labeling
Questions are typically answered by free-text answers, posted independently by multiple users. In order to convert these answers into a single yes/no answer, we first classified each answer into one of three classes: yes, no and maybe, and then used majority vote among the classified answers. We used a pre-trained RoBERTa-based classifier, and trained the model on McAuley's dataset (McAuley, 2016), taking only yes/no questions. See Appendix A for details.

Experiments
We experiment with eleven product categories covered by our Amazon-PQA dataset (Section 4.2), training a SimBA answer prediction model for each of the categories independently. Next, we describe the data preparation steps for each of the SimBA components.

Data Preparation
Sibling Retrieval Using AKNN For each record r ∈ C (C is the category dataset), we use AKNN to retrieve the top-K similar siblings from C, while  making sure that neither of them share the same product with r. We collect training example pairs by coupling each record r with each of its siblings: For retrieval we use Universal Sentence Encoder (USE) (Cer et al., 2018) to embed each question q i into a 512-length vector e i . We use the Annoy 11 python library for the implementation of efficient AKNN retrieval. In all experiments, for each record we retrieve the top-K (K = 500) similar records, based on the cosine-similarity between the embedding vectors.
Twin Detection Using the Q2Q Model For each sibling pair (r i , r j ) ∈ D (C), we use our Q2Q model to score their question-similarity and keep only those with Q2Q(q i , q j ) > γ to yield a collection of twin pairs, D(C). We use γ = 0.9 to ensure only highly similar question pairs. For our Q2Q model, we apply a standard pretrained RoBERTa (Liu et al., 2019) classifier. Specifically, we use Hugging-Face base-uncased pre-trained model 12 and fine-tune 13 it for the classification task on our Q2Q dataset 14 , while splitting the data into train, dev and test sets with 80%-10%-10% partition, respectively. For γ = 0.5 (its minimal value) the model achieves test accuracy of 83.2% with a precision of 81.3% and a recall of 87.7%. When setting the twin confidence level threshold to γ = 0.9, the precision of the Q2Q model raises to 89.9% with a recall of 69.5%.
We compare the performance of the Q2Q similarity classifier with several unsupervised baselines, namely: (a) Jaccard similarity, (b) cosine similarity over USE embedding, and (c) cosine similarity over RoBERTa 15 embedding. The results are summarized in

CPS Model
Training The CPS model predicts the contextual similarity between a pair of twin records. In our experiments, the textual content of a product consists of the product title concatenated with the product bullet points, separated by semicolons. The question text is the original query as appeared in the Amazon PQA-dataset. For the encoding modules of the CPS model we use a standard pre-trained RoBERTa-based model as well, while using the [SEP ] token for separating the two inputs to each encoder. For training, twin pairs are labeled according to their contextual similarity using Equation 2. We train, fine-tune, and test, an independent CPS model for each category set C, using D(C), D dev (C), and D test (C) (details of the data split described in Appendix A). The training set D(C) is created as described in Section 5.1. D dev (C) and D test (C), are created the same with one modification -rather than retrieving the siblings for a record from the dataset it belongs to, the siblings are retrieved from D(C), for both D dev (C), and D test (C). This represents a real-world scenario where existing products with their related questions are used as a corpus for predicting the answer to a question about a new product. Each product with all related questions appear only in one of these sets.

Evaluation
We evaluate the CPS model by measuring the accuracy of its contextual similarity prediction over D test (C). The accuracy per category is presented in Table 4. The model achieves a relatively high accuracy with a macro average of 77.2% over all categories, presenting a significant lift of 9.7% over the majority decision baseline. This is an encouraging result, considering the fact that the answers for many questions cannot be directly inferred from the product textual information. We conjecture that the model is able to learn the affinity between different products, in the context of a given question, for predicting their contextual similarity. For example, the two backpacks Ranvoo Laptop Backpack and Swiss Gear Bungee Backpack, were correctly classified by the CPS model as similar (ψ ≥ 0.5) in context of the question "Will this fit under a plane seat?", and classified as different (ψ < 0.5) in context of the question "Does it have a separate laptop sleeve?".

Answer Prediction Methods
We experiment with our SimBA model and with a few baselines over the test set of all categories. The first one is Majority which returns the majority answer among all records in the category. Other methods are described next.
SimBA Given a target record r t , SimBA scores each of its twins by the CPS model and predicts the answer for q t , using Equation 3. w min was finetuned on the combined dev set of all categories and was set to 0.38.

Question Similarity Only (QSO)
We modify the SimBA model to ignore the CPS classification score when implementing the Mixture-of-Experts model (Eq. 3), by setting an equal weight of 1.0 to all twin votes: P red(a t |r t ) = sign r j ∈T (rt) δ(a j ) .

Product Similarity Only (PSO)
We modify the SimBA model by setting q t and q j to empty strings at the input of the CPS model, both during training and during inference, forcing it to rely on the products' textual content alone. The twin retrieval process remains untouched.

Answer Prediction Classifier (APC)
We experiment with a direct prediction approach that only considers the product textual content and the question for answer prediction. For each category C, we fine-tune a pre-trained RoBERTa-based classifier over all records r j ∈ C, using q j and p j (separated by the [SEP ] token) as input and δ(a j ) as the training label.
SimBA+APC The experimental results show that different answer-prediction methods (e.g. SimBA vs APC) may be preferable for different product categories. Therefore, we combine both methods, for achieving optimal results, by mixing   the vote of APC with the twin votes, using the Mixture-of-Experts approach: where α t is the APC predicted answer, and η(r t ) = η 1 , η 2 and η 3 for |T (r t )| ≤ 10, 10 < |T (r t )| < 50 and |T (r t )| ≥ 50, respectively 16 . All η values (η > 0) are fine-tuned on the development set for each category separately. The values we used are detailed in Table 10 in Appendix A.

Answer Prediction Evaluation
The answer prediction accuracy results of all tested predictors, macro-averaged over D test (C) of all categories, are presented in Figure 3. We inspect the performance of the methods on different subsets of the test data, where each subset is determined by all records having at least x twins, x ∈ [0..130].
The horizontal axis indicates the minimal number of twins in the subset and the percentage of the data each subset represents. For example, the results at x = 0 represent the entire test set, while the results at x = 10 represents the subset of questions with at least 10 twins, account for 40.2% of the test set. The performance of Majority begins with 66% (the percent of 'yes' questions in the entire population) and drops for questions with many twins. We 16 We also tried a few different splits on the development set hypothesize that "obvious" questions, for which the answer is the same across many products, are rarely asked hence have fewer twins. In contrast, informative questions, for which the answer is varied across products, are frequently asked w.r.t. many products, hence have many twins. Therefore we see a drop in accuracy of the Majority baseline as the number of twins grows.
The accuracy of QSO is significantly higher than the majority-vote baseline. This demonstrates an interesting phenomena in the data of similar questions that tend to have the same answer over variety of products, typically of the same type. A few examples are presented in Table 5. The QSO method successfully detects these groups of questions and predicts the majority answer for each such group. We find that PSO method generally doesn't improve over QSO. This is somewhat surprising, as we expected that using product similarity information, such as brand, model, or key features, would increase the prediction accuracy. This demonstrates the importance of question-context, as used in SimBA, in addition to the product information alone.
Moving to SimBA, we can see a large performance improvement over the QSO and PSO methods, which we attribute directly to the CPS model. We also see consistent improvement in accuracy with the number of twins, likely due to the larger support the model has for predicting the answer.
The APC method, despite its relative simplicity, performs very well and greatly outperforms the majority-vote and the QSO and PSO baselines. For the segment of questions with less than 10 twins, APC outperforms the SimBA method. This segment represents roughly 60% of the questions. However, for the segment of questions with 60 or more twins, which accounts for 13.6% of the questions, SimBA method consistently outperforms the inductive baseline by 1-2%. When inspecting the results by category, as shown in Table 6, we can see that considering all questions with at least 1 twin, the APC method dominates in 7 out of the 11 categories, while for questions with at least 60 twins, SimBA method dominates in 6 out of the 11 categories.
Finally, we see that the two approaches compliment each other and can be effectively joined, as the SimBA+APC method outperforms both of them over all subsets.

Conclusions
We presented SimBA, a novel answer prediction approach in the PQA domain, which directly leverages similar questions answered with respect to other products. Our empirical evaluation shows that on some segments of questions, namely those with roughly ten or more similar questions in the corpus, our method can outperform a strong inductive method that directly utilizes the question and the textual product content. We further show that the two approaches are complementary and can be integrated to increase the overall answer prediction accuracy.
For future work, we plan to explore how SimBA can be extended and be applied beyond yes-no questions, e.g., for questions with numerical answers or open-ended questions. Another interesting research direction is combining additional voters to the Mixture-of-Experts model, such as a review-aware answer predictor or a product details-based predictor. Additionally, our current evaluation considered a static view of the answered product-question corpus, we plan to explore temporal aspects of our method, for example, considering questions age or ignoring answers of obsolete products that might be irrelevant.

A.1 Amazon-PQSim dataset
The Amazon-PQSim dataset includes question pairs, where all questions are published on Amazon website. Each pairs has a corresponding label: 1 for similar, else 0. The labels were collected via Appen crowd sourcing service. We took the following filtering steps (step b in 4.1) for each question: • Removed any question with less than five words.
• Removed any question with more than 15 words.
• Removed any none-English questions.
• Removed any question with multiple questionmarks (may indicate multiple questions).
• Removed questions with rare words (any word which is not in the top 2000 most frequent words).

Yes/No Question Classification
We followed (He and Dai, 2011)  Next, to determine each question's final yes/no answer, we first omitted answers classified as maybe. When a question is answered by a verified seller, we considered it as most reliable and used it as the final label. Otherwise we used the majority votes among the remaining answers. In our experiments, we ignore questions with an equal number of yes and no answers.
Dataset Split Each item in our dataset is a (product, question, answer) triplet. We split the labeled triplets into train (80%), dev (10%), and test (10%) sets for each category, relating to the number of products. Each product with all related questions appear only in one of these sets. The statistics for this dataset are given in Table 8.

A.3 CPS Model Details
The CPS has a total of 254.6M parameters. For all incorporated RoBERTa models we use a maximum sequence length of 256, dropout of 0.1 , and a 32 batch size for training. We applied different learning rates and number of epochs for each product-category. The specific values we used after tuning are shown in Table 9. Rest