Semi-supervised Category-specific Review Tagging on Indonesian E-Commerce Product Reviews

Product reviews are a huge source of natural language data in e-commerce applications. Several millions of customers write reviews regarding a variety of topics. We categorize these topics into two groups as either “category-specific” topics or as “generic” topics that span multiple product categories. While we can use a supervised learning approach to tag review text for generic topics, it is impossible to use supervised approaches to tag category-specific topics due to the sheer number of possible topics for each category. In this paper, we present an approach to tag each review with several product category-specific tags on Indonesian language product reviews using a semi-supervised approach. We show that our proposed method can work at scale on real product reviews at Tokopedia, a major e-commerce platform in Indonesia. Manual evaluation shows that the proposed method can efficiently generate category-specific product tags.

1 Introduction E-commerce product reviews are a rich source of direct feedback from the customers. Written in free text natural language, product reviews contain a significant amount of information regarding a variety of topics that are important to prospective buyers.
Tokopedia conducted customer survey research to understand the sources of information that potential buyers assess while making a purchase decision. This internal research shows that around 15% customers consider product reviews as the most important source of information and it is the third 1 www.tokopedia.com highest among all 20 possible information sources. Internal analysis of the "click rate" of various components on the platform's product listing page also shows that components related to product reviews have the second highest click rate which further emphasises the importance of product reviews for prospective buyers.
Although reviews are important information sources, manually filtering relevant information is a cumbersome process for a buyer when making a purchase decision. Tokopedia has several hundreds of millions of customer reviews, generated by millions of users over the years. Therefore, extracting relevant tags for each product so that prospective buyers can quickly filter the most relevant reviews based on their topic of interest becomes important to make a quick purchase decision and improve buyer engagement on the platform.
We categorize topics in reviews into two types. The first type of topics are the generic topics that exist in reviews of products from any category, and they are about the generic information that customers care about. In the e-commerce platform, for example, the generic topics are "customer service", "delivery, "packaging quality", "price", and so on. The second type of topics are the category-specific topics. These topics are detailed description of the product specific attributes. Since different products have different attributes, the category-specific topics are very different for products from different categories. For example, for products in Phone Case category, the category-specific topic could be "cable hole", while for products in Herbal Medicine category, the category-specific topic would be "ingredients". The focus of this paper is to generate tags of category-specific topics for products across different categories.
There are several challenges for this work. Firstly, the category-specific topics are widely different among products of different categories. Therefore, it's impossible to get labeled data to apply supervised methods which are normally used when generating tags. Secondly, we work on informal Indonesian language. Though Indonesian language shares the same alphabet with English, Indonesian language differs from English in certain significant ways such as different sentence structure, prefix and suffix modifiers and slang spellings. Also since we work on reviews, the texts are informal, and contain a mixture of Indonesian, English, abbreviations and slang, which further increases the difficulty.
The focus of this work is to address the above mentioned challenges. We proposed a semisupervised method, and successfully applied it to product reviews from different categories in the ecommerce platform. We also evaluated our results with manually labeled data.
The rest of this paper is organized as follows. We describe related work in the literature in Section 2. We then describe our approach to extract categoryspecific tags from Indonesian language review text in Section 3. Experiments and results are discussed in Section 4.

Related Work
While we can use a supervised learning approach to get generic topics from product reviews, it is impossible to use supervised approaches with "categoryspecific" topics due to the sheer number of possible topics for each product category. Therefore, we use an unsupervised method to extract topics from product reviews in this paper.
One of the earliest unsupervised method to extract keywords from text is the statistics based method. Frequency or Term Frequency -Inverse Document Frequency (TF-IDF) score is calculated on the n-grams of all the reviews. The n-grams with higher score will be extracted as tags. Graph-based methods (Mihalcea and Tarau, 2004;Altuncu et al., 2019) can also used to extract keywords, where each token is a vertex and an edge is defined when two tokens are in the same context window. Both methods however, fail to group n-grams of similar meaning together.
Latent Dirichlet Allocation (LDA) (Blei et al., 2003) and it's variants (Yan et al., 2013;Xiong and Guo, 2019) are popular methods to group words into topics. However LDA processes a document as a bag of words with the assumption that each word is independent of each other. Therefore this method loses valuable occurrence information. Clustering method like k-means, DBSCAN can group similar words based on word embedding. However, word embedding is high dimensional data and clustering fails to work well on it due to the curse of dimensionality.
A neural network model was proposed by He et al. (2017) to group phrases into topics. It overcomes the drawbacks of LDA and clustering methods by utilizing the embedding information with attention mechanism to attend to important tokens in the sentence. We use this model in this paper.

Phrase Extraction
We extract phrases from each text review using Stanford NLP's dependency parser (Manning et al., 2014). Among all the extracted dependencies (Nivre et al., 2016), we choose three kinds as shown in Table 1. These dependencies are about nouns, as the phrases extracted by them are more likely to be about the products. Examples of dependencies that are not selected such as verb, adverb and so on is shown in Table 2.  We further drop phrases which contain stop words derived from NLTK Indonesian stop word list (https://www.nltk.org/), and a list that is manually labeled by an internal product team. We only remove stopwords after phrase extraction, since phrase extraction needs the complete sentence input to extract phrases more accurately.

Topic Generation
A topic is a group of phrases sharing a similar concept. Different topics, on the other hand, are separate groups of phrases of different concepts. On the phrases from each product category, we apply the Unsupervised Aspect Extraction (UAE) model (He et al., 2017) to extract topics. The UAE model generates topics by first learning K topic embeddings, the number of topics K is predefined. Phrases within a product category are then grouped to the topic that is closest in embedding. As shown in Figure 2, the model has three layers: the embedding layer, the attention layer and the auto-encoder layer. We concatenate the review phrases from one product as the input to the embedding layer. The embedding layer is initialized with a word2vec embedding of dimension d, that is trained on all the reviews of this category. Since StanfordNLP dependency parser generates phrases with two tokens, concatenating the embeddings of each token in the phrase gives us a phrase embedding of dimension 2d.
The attention layer takes these phrase embeddings, and calculates a weighted sum of the phrases, as z s = n i=1 a i e w i , where e w i ∈ IR 1×2d is the embedding for the i th input phrase, and a i is the weight computed by the attention layer based on both the relevance of the filtered phrase to the K aspects and the relevance to the whole sentence which is trained with the following formulas.
In the auto-encoder layer, the encoder compresses z s to a vector of probabilities p t with p t = softmax(W · z s + b) and the decoder reconstructs a sentence embedding with r s = T T · p t . Here T ∈ IR K·2d is the learned aspect embedding matrix, which is in the same embedding space as the phrase embedding.
The loss function of the model is defined as L(θ) = J(θ) + λU (θ), where θ represents the model parameter, J(θ) is proportional to the hinge loss between r s and z s , and U (θ) is the regularization term which encourages orthogonality among the rows in the aspect embedding T .

Category-specific Topic Filtering
Category-specific topics are unique to each product category and not generic. To sift out the general topics from all the generated topics, we use a supervised method.
As the generic topics are similar across all product categories, we made a general word list which contains the frequent words in general phrases. Examples from the general word list are berfungsi, semoga, bonus, sis, kwalitas, oke, super, boss. (The English translations are function, hopefully, bonuses, sis, quality, okay, super, boss.) A phrase is considered a general phrase if both words in the phrase are in the general word list. If more than a certain percentage η of all the phrases in one topic are general phrases, the topic is considered a general topic, otherwise the topic is a generated category-specific topic, which will be used in the next step.
After supervised filtering, manual labeling is applied to each phrase on the generated categoryspecific topics. Since we've already applied topic extraction and supervised filtering, the number of phrases to be manually labeled is reduced dramatically. For each phrase, we label it either as generic, incoherent or category-specific. Generic phrases are those phrases about general aspects, including delivery, fits description, packing quality, customer service, price. General descriptions about the product quality are also general phrases, these phrases can be used to describe products from most of other categories as well, such as produk bagus (good product). Incoherent phrases are those that are not about the same concept as the majority of the other phrases in the same topic. And category-specific phrases are the phrase about the category-specific aspects of the category, and they are coherent with the majority of the phrases in the same topic.
The category-specific phrases in each topic will be used for tag generation as will be described in Section 3.4. And the frequent words in the generic phrases will be added to the general word list for use in supervised filtering of future topics.

Category-specific Tag Generation
With the filtered category-specific topics, we generate the category-specific tags.
For each product, we group the review phrases to corresponding topics as discussed in Section 3.1 and Section 3.2. We use supervised method shown in Section 3.3 to filter category-specific topics from all the generated topics. Then, we rank the phrases in each topic according to the frequency of phrases in the reviews of this product and choose the one with highest ranking as the tag of this topic for this product. The results are uploaded to a data warehouse.

Experimental Setup
In this section, we apply our proposed method to product reviews from Tokopedia. We demonstrate the experimental results, and show the evaluation results of the generated category-specific topics.

Data
We use reviews from 89.5 Million products across 18 product categories as the dataset. The average number of reviews in each category, and the average string length of reviews in shown in Table 3 (column: "#reviews" and "average length").

Model Result
After doing phrase extraction, we applied UAE model for topic extraction. We performed the same preprocessing as He et al. (2017) and used word2vec to train the word embeddings with dimension d = 200. We modified the model structure to accept phrase input as described in Section 3.2, and we shared the same parameter settings as He et al. (2017). We apply our method to each category separately, and we set the number of topics as K = 14 for topic generation. Then, we apply category-specific filter on the extracted topics for all categories with η = 40%. The general word list we used contains 127 words. The average time to get generated categoryspecific topics on extracted phrases is 2 hours per category with around 0.5M reviews. On average, we generate 5 category-specific topics for each category. We show the number of generated categoryspecific topics for each category in Table 3 (column: "#topics"). We show some of these generated category-specific topics in Table 4.

Evaluation
The most essential part of this work is the automatic generation of category-specific topics. In this section, we show the evaluation results for the quality of the category-specific topic generation.

Evaluation Metric
An internal product team labeled the results from supervised filtering. They label each phrase as category-specific, general or incoherent as described in Section 3.3. On average, it took one person 3 minutes to label all the phrases of one topic. We apply the evaluation metrics used in He et al. (2017) and Chen et al. (2014). Following their setting, we get the score precision@n (p@n) for each generated category-specific topic, as the number of category-specific phrases among the top n phrases. We show the average p@100 for sample categories in Table 3 (column: "average p@100"). From the result, we can see the majority of the phrases in the generated topics are categoryspecific in meaning. We define any topic with p@n > 60 as a category-specific topic, and we define topic rate as topic rate = #category-specific topics #generated category-specific topics We show the topic rate for selected category in Table 3 (column: "topic rate"). We can see more than half of the generated category-specific topics will be selected after manual filtering, thus, human labeling will be very efficient on the automatically generated category-specific topics.

Conclusion
In this paper, we described a pipeline for categoryspecific review tagging using phrase extraction, topic generation, category-specific topic filtering and tag generation. Given the product reviews, the pipeline generates the category-specific tags for each product and customers can filter product reviews with these tags. The pipeline is being implemented on product reviews at Tokopedia, and proved to be successful when scaled to large number of reviews. We also evaluated the quality of the generated category-specific topics with manual labeling and results show that the pipeline can generate coherent category-specific topics.