A Label Informative Wide & Deep Classifier for Patents and Papers

In this paper, we provide a simple and effective baseline for classifying both patents and papers to the well-established Cooperative Patent Classification (CPC). We propose a label-informative classifier based on the Wide & Deep structure, where the Wide part encodes string-level similarities between texts and labels, and the Deep part captures semantic-level similarities via non-linear transformations. Our model trains on millions of patents, and transfers to papers by developing distant-supervised training set and domain-specific features. Extensive experiments show that our model achieves comparable performance to the state-of-the-art model used in industry on both patents and papers. The output of this work should facilitate the searching, granting and filing of innovative ideas for patent examiners, attorneys and researchers.


Introduction
Classifying patents and papers to a technology taxonomy is a crucial step to organize the massive knowledge and to discover innovative ideas. Patent examiners rely on the taxonomy to search for similar documents when granting or invalidating a patent application; attorneys use it to check whether the innovation points of an invention have been covered in previous literature; researchers use the taxonomy to monitor the technology trends in certain fields, and companies use it to outline the intellectual property landscape of its own or its competitors'. The most commonly used taxonomies are International Patent Classification (IPC) 1 and its newer version Cooperative Patent Classification (CPC) 2 . Figure 1 illustrates the CPC hierarchy and the discriminative descriptions attached to each node. The categorization is now mostly done manually by experts in patent offices. Due to the growing amount of patents and the limited number of domain experts, there has been an urge to automate the classification process. Also, as more and more technology innovations are published in patents, researchers with no background in patent classification may want to know which patents are most relevant to an academic paper. For this end, we aim to classify both patents and papers to the CPC subclass with more than 600 labels.
Classifying patents and papers to CPC subclass is a challenging task because (1) there are a large number of labels that covers almost all technology domains, and the differences between labels are often subtle; (2) although mass amount of labelled data for patents is available, the annotated data for paper-to-CPC is very limited. Since labelling papers with CPC labels requires expert knowledge, large-scale human annotation is very expensive.
In this paper, we leverage the CPC label descriptions and use the Wide-and-Deep network to integrate label information with semantic information from input texts. We also construct a distantsupervised dataset for papers. Our contributions are: • We prove the effectiveness of the label features through the Wide-and-Deep structure in CPC classification on more than 600 labels.
• We achieve comparable performance to the state-of-the-art on classifying both patents and papers to a widely-used technology taxonomy. Our model can serve as a simple and effective baseline for CPC classification tasks.
2 Related Work

Patent Classification
Most of the previous patent classification systems focus on developing more features derived from patent structures with traditional learning algorithms (Verberne and D'hondt, 2011). Going beyond document-level features, Cai and Hofmann (2007) and Qiu et al. (2011)  There are some efforts on mapping papers to IPC in the context of patent retrieval combined with K nearest neighbours (KNN) classification.  adopt a query-expansion approach to retrieve relevant patents and use a KNN classifier to label the research paper. Xiao et al. (2008) combine different scoring methods to rerank the retrieved IPC and achieved the best performance in the NTCIR-7 workshop for classify research papers to IPC system (Nanba et al., 2008). To our best knowledge, there has been no attempt tackling both tasks in one model.

Label Information
Leveraging label information is not new and is mostly accomplished by embedding labels and texts in the same space to measure their correlations (Yogatama et al., 2015;. Ma et al. (2016) adds prototypes to the label representation.  proposes to transform classification to a matching target between texts and labels for multi-task learning.  further weights text features by the compatibilities between text and label embeddings via attention mechanism. Our model differs Figure 2: Example of the label description (left boxes) and classification cues (shadowed texts). The true label for the input text (right box) is B01D. Although the given text discussed neural networks, which is semantically closer to G06N, the classification cue of "ultrafiltration" decides for B01D eventually.
from previous studies in that we use the stringlevel similarity between label descriptions and input text instead of label embeddings. Based on the analysis of CPC classification system, we believe that string-level similarity can compensate for what semantic-level similarity cannot captures for patent classification tasks.

The Label-text Feature
We discover that label descriptions can provide precise cues to classify a document which contains multiple semantic aspects. Figure 2 provide an example on the necessity of integrating label description. It should be noted that when patent examiners classify documents to the CPC system, they are also advised to use cue words and to search among label descriptions 3 .
We integrate the label information through the label-text feature that captures the string-level relatedness between label descriptions and texts. Here we use BM25 score:.
where x i is the ith item in text x; D k is the label description for class k; idf and tf are inverse document frequency and text frequency for x i in D k respectively; avgdl is the average description length; |D k | is the document length; k 1 and b are hyper parameters.

Wide and Deep Structure
We adopt the Wide and Deep (WnD) neural network (Cheng et al., 2016) for text classification . Given a training set (X n , D, y n ) N n=1 , where X n is the input document texts, D is the input label descriptions and y n is the true labels. The model outputs the probabilities for each of the K classesŷ ∈ R K . The training target is to minimize binary cross entropy loss: n is the kth element in vector y n .
An overview of the WnD classifier is shown in figure 3. The model has two parts: the wide part and the deep part. The wide part takes in the label-text feature to capture string-level relatedness between the label descriptions and the text; the deep part maps the input text to word embeddings and go through a non-linear transformation to capture semantic-level relatedness between the text and the label.
The Wide part: The Wide part is a regression model with the formŷ wide = σ(W T wide z wide +b), where z wide is the label-text interaction features as described in section 3.1: z wide = BM25(D, X n ).
The Deep part: The Deep part is a non-linear transformation of the input text that aims to capture the semantic of the text. It can be a classic neural network for text encoding, such as RNN, CNN, or simple fully connected network. In this paper, we use textCNN (Kim, 2014) for the transformation because it is a simple baseline that works reasonably well. The Deep part transform the texts to a fixed-length representation z deep . The representation z deep is then mapped to K classes using sigmoid activationŷ deep = σ(W T deep z deep + b). The Wide and Deep parts are concatenated at the top and are jointly trained throughŷ = σ(W T deep z deep + W T wide z wide + b) in order to let the semantic and the string level relatedness complements each other when making the decision. In this way, the model simulates the behavior of patent examiners classifying a document: when they are uncertain which labels to assign (i.e. when semantic knowledge cannot provide a certain answer), examiners will resort to searching for cue words in the label descriptions for a clue.

Datasets
We remove stopwords and punctuations and choose the first 120 words per document. The word embeddings are 300 dimensional and initialized randomly. Kernel size for textCNN is 2,3,4 and 5, and the number of filters is 1024. For each CPC subclass, we use the descriptions of its own and of all its child labels. We train the model using Adam optimizer.
Datasets: Out of the USPTO patent set, We randomly sample 6.7 million abstracts as the patent training set and 60k as the testing set. For the paper testing set, the gold-standard is hard to obtain. We discover that some papers cited by patents are assigned CPC labels by European Patent Office, we collect those from the website 4 and derive 4956 testing instances for paper-to-CPC classification. The datasets are described in  Evaluation metrics: As each patent/paper has one or more CPC labels, we measure our model from both the classification and the ranking perspectives with 3 metrics: (1) example-based precision/recall: the average precision/recall per instance. We measure precision and recall on the top1, top3 predictions and precision on all predictions with the probability score ≥ 0.5. (2) macro precision/recall: the average precision/recall per class. (3) mean average precision (MAP): a ranking-based metric that measures whether the right labels are placed before the wrong ones.

Classify Patents to CPC
We compare our WnD classifier on patents with two baselines: traditional textCNN and attention-textCNN. By comparing WnD with textCNN, we want to know whether the label-text feature can complement semantic information for classification; by comparing with attention-textCNN, we want to compare our label integration method with other label embedding-based methods. For the attention model, we borrowed the idea of labelembedding attentive model . The attention is a T -dimensional vector where T is the text length. It calculates the importance of each word to the classification task.
The WnD achieves significant gains with label information (see table 2). It suggests that the complementary effects of string-level relatedness between label and texts indeed benefits the final classification decision. Our model also outperforms attention-textCNN. Although label embeddings are helpful for small label sets (around 10 labels) , it is less effective on hundreds of labels. We suspect the reason is that the attention is not discriminative between classes. When the label set is large, many non-stop words may be important for classification. But their weights should vary for different classes, which can hardly be captured by the attention vector. Also, attention-textCNN has much more parameters than textCNN and tends to overfit on the training data.
The best reported numbers on patent-to-IPC in subclass level achieves 0.74 on precision (Verberne and D'hondt, 2011). Although can not be directly compared, our large-enough testing set makes it confident that we are comparable with the state-of-the-art system while being more scalable.

Classify Papers to CPC
There is not enough labelled data for the paperto-CPC task. We can directly apply the model trained on patents to papers, but the performance will degrade significantly due to domain difference. For example, the word camera in papers is commonly referred to as photo capturing apparatus in patents. To deal with the domain-adaptation issue, we propose two approaches: distant supervision: We auto-label papers using patent-paper citation. For each paper, we label it with the CPCs of the patents that cite it. We assume that papers cited by a patent should be relevant to the given patent in terms of background and technology domain. We get a total 1.7 million papers with abstract that are cited by the patents in the USPTO patent set, and each paper gets on average 2.8 labels. On these auto-labelled papers we then fine tune the model originally trained on patents.
domain-adapted features: The WnD structure enables us to incorporate domain-adapted features to the Wide part. Here we proposed two ways to add such features: • prototyping: We pick the top 20 terms from the papers for each of the K classes according to the tf-idf score. Those representative terms are used as the label descriptions for papers; • label expansion: We train word embeddings using skip-gram (Mikolov et al., 2013) on papers and expand the original label descriptions with the 10 nearest words in the embedding space according to their cosine distance.
We compare our paper-to-CPC model with the best-performing KNN+reranking model (Xiao et al., 2008) introduced in section 2.1. We also want to compare our model with large-scale classification systems used in industry. In order to do that, we crawl from Google Patent the machineclassified CPC labels of scholar papers for our testing set 5 , and we assume that the labels are ranked according to the order on the web page. Google classifies papers on the finest level. In order to compare our subclass results with it, we use only the subclass part of the first label, which is supposed to be the most confident one.
The comparison results are shown in table 3. WnD benefits from both transfer learning and domain-adapted features. Since the prototypes are automatically collected, it is possible to apply WnD to classification tasks where detailed label descriptions are not available.
Google Patent scores better on precision/recall@1, but performs less well on macro precision/recall.
Since Google classifies the papers on a finer grained level, the classifier may receive more information during training, thus performing better on coarser grained levels. To   Table 3: Test results on papers. textCNN and WnD are models trained on patents directly apply to papers. WnD+transfer refers to WnD fine tuned on auto-labelled papers.*The training set and granularity of Google Patent model may be different from other models. We put it here for the convenience to compare and discuss investigate the effects of classification granularity on performance, we mapped the WnD subclass predictions to class level (128 label) and trained another WnD on class level. The p@1 and r@1 are 86.94% and 79.15% for WnD subclass-toclass and 82.43% and 74.72% for WnD class. The gap indicates the possible positive effects of fine-granularity training.

Conclusions and Future Works
In this paper, we propose a WnD classifier to map both patents and papers to CPC subclasses. The model captures both the string and semantic relatedness between labels and texts. We achieve comparative performance to the state-of-the-art models for both paper-to-CPC and patent-to-CPC tasks. We hope to contribute an intuitive, simple yet practically effective baseline for categorizing scientific publications.
Although CPC subclass already has over 600 labels, it is still a relative coarse granularity in the taxonomy. The finest level (subgroup) consists of over 200 thousand labels and provides much more detailed classification information. At the same time, with the explosion of labels, the task is much more challenging. In the future, we will go deeper into the taxonomy and try to explore the hierarchical relations between labels and improve the scalability of models for finer grained label sets.