Label-Guided Learning for Item Categorization in e-Commerce

Item categorization is an important application of text classification in e-commerce due to its impact on the online shopping experience of users. One class of text classification techniques that has gained attention recently is using the semantic information of the labels to guide the classification task. We have conducted a systematic investigation of the potential benefits of these methods on a real data set from a major e-commerce company in Japan. Furthermore, using a hyperbolic space to embed product labels that are organized in a hierarchical structure led to better performance compared to using a conventional Euclidean space embedding. These findings demonstrate how label-guided learning can improve item categorization systems in the e-commerce domain.


Introduction
Natural language processing (NLP) techniques have been applied extensively to solve modern ecommerce challenges (Malmasi et al., 2020;Zhao et al., 2020). One major NLP challenge in ecommerce is item categorization (IC) which refers to classifying a product based on textual information, typically the product title, into one of numerous categories in the product category taxonomy tree of online stores. Although significant progress has been made in the area of text classification, many standard open-source data sets have limited numbers of classes which are not representative of data in industry where there can be hundreds or even thousands of classes (Li and Roth, 2002;Pang and Lee, 2004;Socher et al., 2013)To cope with the large number of products and the complexity of the category taxonomy, an automated IC system is needed and its prediction quality needs to be high  enough to provide positive shopping experiences for customers and consequently drive sales. Figure 1 shows an example diagram of the product category taxonomy tree for the IC task. In this example, a tin of Japanese tea 1 needs to be classified into the leaf level category label "Japanese tea." As reviewed in Section 2, significant progress has been made on IC as a deep learning text classification task. However, much of the progress in text classification does not make use of the semantic information contained in the labels. Recently there have been increasing interest in taking advantage of the semantic information in the labels to improve text classification performance (Wang et al., 2018;Liu et al., 2020;Du et al., 2019;Xiao et al., 2019;Chai et al., 2020). For the IC task, labels in a product taxonomy tree are actively maintained by human experts and these labels bring rich semantic information. For example, descriptive genre information like "clothes" and "electronics" are used rather than just using a numeric index for the class labels. It is reasonable to surmise that leveraging the semantics of these category labels will improve the IC models.
Although label-guided learning has been shown to improve classification performance on several standard text classification data sets, its application to IC on real industry data has been missing thus far. Compared to standard data sets, e-commerce data typically contain more complicated label taxonomy tree structures, and product titles tend to be short and do not use standard grammar. Therefore, whether label-guided learning can help IC in industry or not is an open question worth investigating.
In this paper, we describe our investigation of applying label-guided learning to the IC task. Using real data from Rakuten 2 , we tested two models: Label Embedding Attentive Model (LEAM) (Wang et al., 2018) and Label-Specific Attention Network (LSAN) (Xiao et al., 2019). In addition, to cope with the challenge that labels in an IC task tend to be similar to each other within one product genre, we utilized label embedding methods that can better distinguish labels which led to performance gains. This included testing the use of hyperbolic embeddings which can take into account the hierarchical nature of the taxonomy tree (Nickel and Kiela, 2017).
The paper is organized as follows: Section 2 reviews related research on IC using deep learningbased NLP and the emerging techniques of labelguided learning. Section 3 introduces the two label-guided learning models we examined, namely LEAM and LSAN, as well as hyperbolic embedding. Section 4 describes experimental results on a large-scale data set from a major e-commerce company in Japan. Section 5 summarizes our findings and discusses future research directions.

Related works
Deep learning-based methods have been widely used for the IC task. This includes the use of deep neural network models for item categorization in a hierarchical classifier structure which showed improved performance over conventional machine learning models (Cevahir and Murakami, 2016), as well as the use of an attention mechanism to identify words that are semantically highly correlated with the predicted categories and therefore can provide improved feature representations for a higher classification performance (Xia et al., 2017).
Recently, using semantic information carried by label names has received increasing attention in text classification research, and LEAM (Wang et al., 2018) is one of the earliest efforts in this direction that we are aware of. It uses a joint embedding of both words and class labels to obtain label-specific attention weights to modify the input features. On a set of benchmark text classification data sets, LEAM showed superior performance over models that did not use label semantics. An extension of LEAM called LguidedLearn (Liu et al., 2020) made further modifications by (a) encoding word inputs first and then using the encoded outputs to compute label attention weights, and (b) using a multihead attention mechanism (Vaswani et al., 2017) to make the attention-weighting mechanism have more representational power. In a related model, LSAN (Xiao et al., 2019) added a label-specific attention branch in addition to a self-attention branch and showed superior performance over models that did not use label semantics on a set of multi-label text classification tasks.
Alternatively, label names by themselves may not provide sufficient semantic information for accurate text classification. To address this potential shortcoming, longer text can be generated based on class labels to augment the original text input. Text generation methods such as using templates and reinforcement learning were compared, and their effectiveness were evaluated using the BERT model (Devlin et al., 2019) with both text sentence and label description as the input (Chai et al., 2020).
Finally, word embeddings such as word2vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014) are generated in Euclidean space. However, embeddings in non-Euclidean space called hyperbolic embeddings have been developed (Nickel and Kiela, 2017;Chen et al., 2020a,b) and have been shown to better represent the hierarchical relationship among labels.

Model
For a product title X consisting of L words X = [w 1 , . . . , w L ], our goal is to predict one out of a set of K labels, y ∈ C = {c 1 , . . . , c K }. In a neural network-based model, the mapping X → y generally consists of the following steps: (a) encoding step (f 0 ), converting X into a numeric tensor representing the input, (b) representation step (f 1 ), processing the input tensor to be a fixed-dimension feature vector z, and (c) classification step (f 2 ), mapping z to y using a feed-forward layer.
Among label-guided learning models, we chose both LEAM (Wang et al., 2018) and LSAN (Xiao Step LEAM LSAN f 0 Word embedding Word embedding + Bi-LSTM encoding f 1 Only label-specific attention Both self-and label-specific attentions + adaptive interpolation f 2 Softmax with CE loss Softmax with CE loss   Table 1 shows a comparison between these models.

LEAM
The LEAM architecture is shown in Figure 2 (Wang et al., 2018). First a product title of length L is en- where v l ∈ R D is determined through word embedding and V ∈ R D×L . The class labels are also encoded via label embedding as C = [c 1 , ..., c K ] where K is the total number of labels, c k ∈ R D and C ∈ R D×K . The label embeddings are title-independent and is the same across all titles for a given set of labels. We can then compute the compatibility of each wordlabel pair based on their cosine similarity to obtain a compatibility tensor G ∈ R L×K . The compatibility tensor is transformed into an attention vector through the following steps, (a) apply a 1D convolution to refine the compatibility scores by considering its context, (b) apply max pooling to keep the maximum score, and (c) apply a softmax operation to obtain the label-guided attention weights β. These attention weights containing the label semantic information are used in the f 1 step to compute a new representation, After obtaining z, we use a fully-connected layer with softmax to predict y ∈ C. The entire process f 2 (f 1 (f 0 (X ))) is optimized by minimizing the cross-entropy loss between y and f 2 (z).

LSAN
The LSAN architecture is shown in Figure 3 (Xiao et al., 2019). As shown in Table 1, LSAN has a few modifications compared to LEAM. First, a bi-directional long short-term memory (Bi-LSTM) encoder is used to better capture context semantic cues in the representation. The resulting con- where − → H and ← − H represent LSTM encoding outputs from forward and backward directions and H ∈ R L×2P where P is the dimension of the LSTM hidden state. For model consistency we typically set P = D.
Additional features of LSAN which extend LEAM include (a) using self-attention on the encoding H, (b) creating a label-attention component from H and C, and (c) adaptively merging the selfand label-attention components.
More specifically, the self-attention score A (s) is determined as where W 1 ∈ R da×2P and W 2 ∈ R K×da are selfattention tensors to be trained, d a is a hyperparameter, A (s) ∈ R K×L and each row A of all L words to label j. Therefore, is a representation of the input text weighted by self-attention where M (s) ∈ R K×2P . From the title encoding H and the label embedding C, compatibility scores between class labels and title words can be computed as the product where A (l) ∈ R K×L and each row A (l) j· is a Ldimensional vector representing the contributions of all L words to label j. The product title can be represented using label attention as where M (l) ∈ R K×2P . The last procedure in the f 1 step of LSAN is to adaptively combine the self-and label-attention representations M (s) and M (l) as where the two interpolation weight factors (α, β ∈ R K ) are computed as with the constraint α j + β j = 1, W 3 , W 4 ∈ R 2P are trainable parameters, σ(x) ≡ 1/(1+e −x ) is the element-wise sigmoid function, and M ∈ R K×2P .
Although the original LSAN model proposed multiple additional layers in its f 2 step, in our implementation we performed average pooling along the label dimension and then to a fully-connected layer with softmax output, similar to LEAM. Finally, the cross entropy loss is minimized.

Hyperbolic Embedding
In e-commerce item categorization we tend to use a more complicated label structure with a large number of labels organized as a taxonomy tree compared to standard text classification data sets. One immediate issue is that hundreds of labels can exist at the leaf level, some with very similar labels like "Japanese tea" and "Chinese tea," and the difference in label embedding vectors in Euclidean space can be too small to be distinguished by machine learning models. Such issues become more severe with increasing taxonomy tree depth as well. Hyperbolic embedding is one technique that has been developed which can address these issues (Nickel and Kiela, 2017;Chen et al., 2020a,b).
Hyperbolic space is different from Euclidean space by having a negative curvature. Consequently, given a circle, its circumference and disc area grow exponentially with radius. In contrast, in Euclidean space the circumference and area grow only linearly and quadratically, respectively. For representing hierarchical structures like trees, this property can ensure that all leaf nodes which are closer to the edge of the circle maintain large enough distances from each other.
As a specific application, Poincaré embedding uses the Poincaré ball model which consists of points within the unit ball B d where the distance  between two points, u, v ∈ B d is defined as .
(10) The Poincaré embedding is obtained by minimizing a loss function depending only on d(u, v) for all pairs of labels (u, v) using Riemannian optimization methods. Figure 4 illustrates the differences between using an Euclidean space and a Poincaré ball model when representing nodes organized in a tree. Using a hyperbolic embedding has the potential to maintain large enough distances when our models aim to distinguish subtle differences among these labels.

Experimental Setup
Data set: Our data set consisted of more than one million products in aggregate from Rakuten, a large e-commerce platform in Japan, focusing on four major product categories which we call root-level genres. Our task, a multi-class classification problem, was to predict the leaf-level product categories from their Japanese titles. Further details of our data set are shown in Table 2.
Evaluation metric: We used the macro-averaged F-score F to evaluate model performance. This is defined in terms of the per-class F-score F k as where K is the total number of classes, and P k and R k are the precision and recall for class k.
Pre-trained embedding methods: We tested the following three methods: • All genre: Word embedding pre-trained on all of the data across different root-level genres; for the label embedding, the average of the word embedding from all word tokens in a label is used to initialize the label embedding C and this is further updated in the model training process.
• Genre specific: Word embedding pre-trained from data specific to each root-level genre; label embeddings were obtained similarly to the all-genre method.
• Poincaré: Label embedding pre-trained on the Poincaré ball taking into account the full hierarchical taxonomy tree.
Models: We compared a number of variants of LEAM and LSAN as described below.
• LSAN Poincaré : LSAN using genre-specific pre-trained word embeddings for the titles and pre-trained Poincaré embeddings for the labels.
Experimental parameters: Our models were implemented in TensorFlow 2.3 using a GPU for training and evaluation. Since Japanese text does not have spaces to indicate individual words, tokenization was performed with MeCab, an open source  Root genre  Class size Train size Dev size Test size Mean words/title  Catalog Gifts & Tickets  29  11,369  1,281  559  31  Beverages  32  205,107  22,805  10,315  21  Appliances  286  399,584  44,529  18,478  20  Men's Fashion  71  593,126  65,939 43,243 23  Japanese part-of-speech and morphological analyzer using conditional random fields (CRF). 3 Once the text was tokenized, we fixed our input length to L = 60 words by truncating the title if it was longer than L and zero-padding the title if it was shorter than L. If a word appeared less than three times, it was discarded and replaced with an out-ofvocabulary token. Pre-trained word embeddings of dimension D = 100 using just product titles were obtained with fastText, which uses a skipgram model with bag-ofcharacter n-grams (Bojanowski et al., 2016). No external pre-trained embeddings were used. After initialization of word and label embeddings with pre-trained values, they were jointly trained with the remaining parameters of the model.
For Poincaré embedding of labels, we used an embedding dimension of 300. Pre-trained Poincaré embeddings of labels were obtained by representing the genre taxonomy tree as (child, parent) pairs and minimizing a loss function which depends only on inter-genre distances as defined in Eq. 10 (Nickel and Kiela, 2017). These pre-trained Poincaré label embeddings were used to initialize the label embeddings in LSAN but during training were allowed to vary according to the standard loss optimization process in Euclidean space.
For LEAM, we used a 1D convolution window size of 5. For LSAN, we set d a = 50, and when we experimented with the Poincaré embedding we set the LSTM hidden state dimension P = 300 to match the Poincaré embedding dimension.
The models were trained by minimizing the cross-entropy loss function using the Adam opti-3 https://taku910.github.io/mecab/ mizer with an initial learning rate of 0.001 (Kingma and Ba, 2015). We used early stopping with a patience of 10 to obtain the final models.

Results and Discussions
Impact of label attention: We examined the impact of label attention by comparing performance without and with label attention for LEAM and LSAN for each of the four root-level genres using all-genre pre-trained word embeddings. The result is shown in Table 3. For LEAM, we do not observe consistent improvements by including the label attention component, contrary to what was previously reported on standard text classification data sets (Wang et al., 2018). On the other hand for LSAN we do observe consistent improvements over all root-level genres by including the label attention component of the model. Since we did not observe a consistent improvement for LEAM in using label attention, for the remainder of this section we focus on variations of LSAN.
Impact of different pre-trained embeddings: We next evaluated the impact of using different pretrained embeddings for the title embeddings as well as the label embeddings for each of the four root-level genres. This is shown in Table 4. We observed that different pre-trained embeddings can consistently have a significant effect on model performance. In particular, using genre-specific embeddings outperformed all-genre embeddings for all genres. This is particularly notable for the smallest genre where we used more than 10 times the data to obtain the all-genre embeddings.
We believe this is because words that occur in the same root-level genre will tend to be embedded closer to each other in the full embedding space,  which then makes it more difficult for the label attention to distinguish between different but similar labels such as "Japanese tea" and "Chinese tea." By using pre-trained embeddings obtained from specific genres, the embeddings become spaced farther apart and therefore the label attention is able to better distinguish labels with similar names. Poincaré embeddings take this further by requiring the embedding space distance between all leafgenre labels to be far apart from each other, and our results show that this leads to the best model performance. This supports our hypothesis that the distance between labels in the label embedding space is an important factor in ensuring that label attention improves model performance.
Compared to models using only the product titles, we see that models using label-guided learning can significantly improve the F-score. In particular, LSAN using a Poincaré label embedding shows the following F-score increases compared to LSAN base: 19.7% for "Catalog Gifts & Tickets," 3.0% for "Beverages," 3.4% for "Appliances," and 3.7% for "Men's Fashion." Note that the largest increase was achieved on the genre with the fewest training instances.

Conclusions
Since 2018, there have been increasing interest in the field of NLP to use the semantic information of class labels to further improve text classification performance. On the item categorization task in ecommerce, a taxonomy organized in a hierarchical structure already contains rich meaning and provides an ideal opportunity to evaluate the impact of label-guided learning. In this paper, we used real industry data from Rakuten, a leading Japanese e-commerce platform, to evaluate the benefits of label-guided learning.
Our experiments showed that LSAN is superior to LEAM because of its usage of context encoding and adaptive combination of both self-and labelattention. We also found that using genre-specific pre-trained embeddings led to better model per-formance than pre-trained embeddings obtained from all product genres. This is likely because pretraining on specific genres allows the embedding to focus on differences between similar genres and the label embeddings are able to take advantage of this. Finally, we showed that using hyperbolic embedding, more specifically Poincaré embedding, can improve model performance further by ensuring that all class labels are sufficiently separated to allow label-guided learning to work well.
One possible limitation of our current work is that although the label embedding is initialized using a hyperbolic embedding, the rest of the training process proceeds in Euclidean space. Future work could explore the possibility of training the entire model in hyperbolic space. Another direction is to incorporate the label-attention mechanism into the BERT model (Devlin et al., 2019), which has proven to be a powerful approach to text encoding. In addition, more advanced approaches to obtaining better representations of labels on top of our existing approach of using word tokens in labels could be explored.