Recognizing Salient Entities in Shopping Queries

Over the past decade, e-Commerce has rapidly grown enabling customers to purchase products with the click of a button. But to be able to do so, one has to understand the semantics of a user query and identify that in digital lifestyle tv , digital lifestyle is a brand and tv is a product. In this paper, we develop a series of structured prediction algorithms for semantic tagging of shopping queries with the product , brand , model and product family types. We model wide variety of features and show an alternative way to capture knowledge base information using embed-dings. We conduct an extensive study over 37 , 000 manually annotated queries and report performance of 90 . 92 F 1 independent of the query length.


Introduction
Recent study shows that yearly e-Commerce sales in the U.S. top 100 Billion (Fulgoni, 2014). This leads to substantially increased interest in building semantic taggers that can accurately recognize product, brand, model and product family types in shopping queries to better understand and match the needs of online shoppers.
Despite the necessity for semantic understanding, yet most widely used approaches for product retrieval categorize the query and the offer (Kozareva, 2015) into a shopping taxonomy and use the predicted category as a proxy for retrieving the relevant products. Unfortunately, such procedure falls short and leads to inaccurate product retrieval. Recent efforts (Manshadi and Li, 2009;Li, 2010) focused on building CRF taggers that recognize basic entity types in shopping query such as brands, types and models. (Li, 2010) conducted a study over 4000 shopping queries and showed promising results when huge knowledge bases are present. (Paşca and Van Durme, 2008;Kozareva et al., 2008;Kozareva and Hovy, 2010) focused on using Hearst patterns (Hearst, 1992) to learn semantic lexicons. While such methods are promising, they cannot be used to recognize all product entities in a query. In parallel to the semantic query understanding task, there have been semantic tagging efforts on the product offer side. (Putthividhya and Hu, 2011) recognize brand, size and color entities in eBay product offers, while (Kannan et al., 2011) recognized similar fields in Bing product catalogs.
Despite these efforts, to date there are three important questions, which have not been answered, but we address in our work. (1) What is an alternative method when product knowledge bases are not present? (2) Is the performance of the semantic taggers agnostic to the query length? (3) Can we minimize manual feature engineering for shopping query log tagging using neural networks?
The main contributions of the paper are: • Building semantic tagging framework for shopping queries.
• Leveraging missing knowledge base entries through word embeddings learned on large amount of unlabeled query logs.
• Annotating 37, 000 shopping queries with product, brand, model and product family entity types.
• Conducting a comparative and efficiency study of multiple structured prediction algorithms and settings.
• Showing that long short-term memory networks reaches the best performance of 90.92 F 1 and is agnostic to query length.
2 Problem Formulation and Modeling

Task Definition
We define our task as given a shopping query identify and classify all segments that are product, brand, product family and model, where: -Product is generic term(s) for goods not specific to a particular manufacturer (e.g. shirts).
-Brand is the actual name of the product manufacturer (e.g. Calvin Klein).
-Product Family is a brand-specific grouping of products sharing the same product (e.g. Samsung Galaxy).
-Model is used by manufacturer to distinguish variations (e.g. for the brand Lexus has IS product family, which has model 200t and 300 F Sport).
For modeling, we denote with T = {⊥, t 1 , t 2 , . . . , t K } the whole label space, where ⊥ indicates a word that is not a part of an entity and t i stands for an entity category. The tagging models have to recognize the following types product, brand, model, product family and ⊥ (other) using the BIO schema (Tjong Kim Sang, 2002).
We denote as x = (x 1 , x 2 , . . . , x M ) a shopping query of length M . The objective is to find the best configurationŷ such that: where y=(y 1 , y 2 , ..., y N ) (N ≤ M ) are the shopping query segments labeled with their corresponding entity category. Each segment y i corresponds to a triple b i , e i , t i indicating the start index b i and end index e i of the sequence followed by the entity category t i ∈ T . When t i = ⊥, the segment contains only one word.

Structured Prediction Models
To tackle the shopping tagging problem of query logs, we use Conditional Random Fields (Lafferty et al., 2001, CRF) 1 , learning to search (Daumé III et al., 2009, SEARN) 2 , structured perceptron (Collins, 2002, STRUCTPERCEPTRON) and a long short-term memory networks extended by CRF layer (Hochreiter and Schmidhuber, 1997;Graves, 2012, LSTM-CRF). CRF: is a popular algorithms for sequence tagging tasks (Lafferty et al., 2001). The objective is 1 taku910.github.io/crfpp/ 2 github.com/JohnLangford/vowpal_wabbit to find the label sequence y = (y 1 , ..., y M ) that maximizes is the normalization factor, λ is the weight vector and f (y, x) is the extracted feature vector for the observed sequence x. SEARN is a powerful structured prediction algorithm, which formulates the sequence labeling problem as a search process. The objective is to find the label sequence y = (y 1 , ..., y M ) that maximizes is a cost sensitive multiclass classifier andŷ are the ground-truth labels. STRUCTPERCEPTRON is an extension of the standard perceptron. In our setting we model a segment-based search algorithm, where each unit is a segment of x (e.g., b i , e i ), rather than a single word (e.g., x i ). The objective is to find the label sequence y = (y 1 , ..., y M ) that maximizes where f (x, y) represents the feature vector for instance x along with the configuration y and w is updated as LSTM-CRF The above algorithms heavily rely on manually-crafted features to perform sequence tagging. We decided to alleviate that by using long short-term memory networks with a CRF layer. Our model is similar to R-CRF (Mesnil et al., 2015), but for the hidden recurrent layer we use LSTM (Hochreiter and Schmidhuber, 1997;Graves, 2012). We denote with h i the hidden vector produced by the LSTM cell at i-th token. Then the conditional probability of y given a query x becomes: where W h y i is the weight vector corresponding to label y i , and W t y i ,y i−1 is the transition score corresponding to y i and y i−1 . During training, the values of W h , W t , the LSTM layer and the input word embeddings are updated through the standard back-propagation with AdaGrad algorithm. We also concatenate pre-trained word embedding and randomly initialized embedding (50-d) for the knowledge-base types of each token and use this information in the input layer. In our experiments, we set the learning rate to 0.05 and take each query as a mini-batch and run 5 epochs to finish training.

Features
Lexical (LEX): are widely used N -gram features. We use unigrams of the current w 0 , previous w −1 and next w +1 words, and bigrams w −1 w 0 and w 0 w +1 . Orthographic (ORTO): are binary mutually nonexclusive features that check if w 0 , w −1 and w +1 contain all-digits, any-digit, start-with-digit-endin-letter and start-with-letter-end-in-digit. They are designed to capture model names like hero3 and m560. Positional (PSTNL): are discrete features modeling the position of the words in the query. They capture the way people tend to write products and brands in the query. Part-of-Speech (POS): capture nouns and proper names to better recognize products and brands. We use Stanford tagger (Toutanova et al., 2003). Knowledgebase (KB): are powerful semantic features (Tjong Kim Sang, 2002;Carreras et al., 2002;Passos et al., 2014). We automatically collected and manually validated 200K brands, products, models and product families items extracted from Macy's and Amazon websites. WordEmbeddings (WE): While external knowledge bases are great resource, they are expensive to create and time-consuming to maintain. We use word embeddings (Mikolov et al., 2013) 3 as a cheap low-maintenance alternative for knowledge base construction. We train the embeddings over 2.5M unlabeled shopping queries. For each token in the query, we use as features the 200 dimensional embeddings of the top 5 most similar terms returned by cosine similarity.

Experiments and Results
Data Set To the best of our knowledge, there is no publicly available shopping query data annotated with product, brand, model, product family and other categories. To conduct our experiments, we collect 2.5M shopping queries through click 3 https://code.google.com/p/word2vec/ logs (Hua et al., 2013). We randomly sampled 37, 000 unique queries from the head, torso and tail of a commercial web search engine and asked two independent annotators to tag the data. We measured the Kappa agreement of the editors and found .92 agreement, which is sufficient to warrant the goodness of the annotations.
We randomly split the data into 80% for training and 20% for testing.  We tune all parameters on the training set using 5-fold cross validation and report performance on the test set. All results are calculated with the CONLL evaluation script 4 .
Performance w.r.t. Features Table 1 shows the performance of the different models and feature combinations. We use the individual features as a baseline. The obtained results show that these are insufficient to solve such a complex task. We compared the performance of the KB and WE features when combined with (LEX+ORTO+PSTNL) information. As we can see, both KB and WE reach comparable performance. This study shows that training embeddings on large in-domain data of shopping queries is a reliable and cheap source for knowledge base construction, when such information is not present. In our study the best performance is reached when all features are combined. Among all machine learning classifiers for which we manually designed features, structured perception reaches the best performance of 88.13 F 1 score. In addition to the feature combination and model comparison, we also study in Figure 1 the training time of each model in log scale against its F 1 score. SEARN is the fastest algorithm to train,   while CRF takes the longest time to train. Among all STRUCTPERCEPTRON offers the best balance between efficiency and performance in a real time setting.
Performance w.r.t. Entity Category Table 3 shows the performance of the algorithms with the manually designed features against the automatically induced ones with LSTM-CRF. We show the performance of each individual product entity category. Compared to all models and settings, LSTM-CRF reaches the best performance of 90.92 F 1 score. The most challenging entity types are product family and model, due to their "wild" and irregular nature.
Performance w.r.t. Query Length Finally, we also study the performance of our approach with respect to the different query length. Figure 2 shows the F 1 score of the two best performing algorithms LSTM-CRF and STRUCTPERCEPTRON against the different query length in the test set. Around 83% of the queries have length between 2 to 5 words, the rest are either very short or very long ones. As it can be seen in Figure 2, independent of the query length, our models reach the same performance for short and long queries. This shows that the models are robust and agnostic to the query length.

Conclusions and Future Work
In this work, we have defined the task of product entity recognition in shopping queries. We have studied the performance of multiple structured prediction algorithms to automatically recognize product, brand, model and product family entities. Our comprehensive experimental study and analysis showed that combining lexical, positional, orthographic, POS, knowledge base and word embedding features leads to the best performance. We showed that word embeddings trained on large amount of unlabeled queries could substitute knowledge bases when they are missing for specialized domains. Among all manually designed feature classifiers STRUCTPERCEPTRON reached the best performance. While among all algorithms LSTM-CRF achieved the highest performance of 90.92 F1 score. Our analysis showed that our models reach robust performance independent of the query length. In the future we plan to tackle attribute identification to better understand queries like "diamond shape emerald ring", where diamond shape is a cut and emerald is a gemstone type. Such fine-grained information could further enrich online shopping experience.