Embedding Meta-Textual Information for Improved Learning to Rank

Neural approaches to learning term embeddings have led to improved computation of similarity and ranking in information retrieval (IR). So far neural representation learning has not been extended to meta-textual information that is readily available for many IR tasks, for example, patent classes in prior-art retrieval, topical information in Wikipedia articles, or product categories in e-commerce data. We present a framework that learns embeddings for meta-textual categories, and optimizes a pairwise ranking objective for improved matching based on combined embeddings of textual and meta-textual information. We show considerable gains in an experimental evaluation on cross-lingual retrieval in the Wikipedia domain for three language pairs, and in the Patent domain for one language pair. Our results emphasize that the mode of combining different types of information is crucial for model improvement.


Introduction
The recent success of neural methods in IR rests on bridging the gap between query and document vocabulary by learning term embeddings that allow for improved computation of similarity and ranking (Mitra and Craswell, 2018). So far vector representations in neural IR have been confined to textual information, neglecting information beyond the raw text that is potentially useful for retrieval and is readily available in many data situations. For example, meta-textual information is available in the form of patent classes in prior-art search, topic classes in search over Wikipedia articles, or product classes in retrieval in the e-commerce domain. A straightforward approach to use such meta-textual information to improve retrieval would require an exact match of meta-textual categories of queries and relevant documents. However, in the majority of cases the sets of categories assigned to queries and documents overlap only at a small percentage, for both relevant and irrelevant documents. Thus, more sophisticated techniques are needed to aid similarity computation and ranking by meta-textual category information.
In this paper, we show how to apply neural embedding methods to meta-textual categories. We show that meta-textual information is an incomplete representation of queries and documents and does not suffice for a standalone computation of similarity and ranking. However, an incorporation of pre-trained embeddings of meta-textual categories into a neural learning-to-rank approach yields significant improvements over a learning-to-rank approach that uses text-only embeddings (Sasaki et al., 2018). In our approach, enhanced embeddings are created by concatenating text embeddings with meta-textual category embeddings, and a fully connected weight layer is learned on top of the concatenation of the enhanced embeddings of a query-document pair by a deep multilayer perceptron that is optimized for pairwise ranking. We present experiments on three different language pairs for cross-lingual retrieval in the Wikipedia domain, and show improvements of up to 2 NDCG points by incorporating meta-textual embeddings in a learning-to-rank framework. Additional experiments on a single language pair in the Patent domain also show improvements of up to 1.3 NDCG points by incorporating embeddings for patent classifications, and up to 6.3 NDCG points if separate models for text and meta information are combined in an ensemble.
3 Learning-to-Rank via Text and Meta-Text Embeddings

Convolutional Embeddings of Textual Information
Similar to the approach of Sasaki et al. (2018), that functions as baseline in our work, we employ a neural learning-to-rank model that learns a relevance score S( c q , c d ) for a vector representation of an English query c q and a foreign-language document c d . These vector representations are computed by a convolutional feature map over a "sentence matrix", with rows consisting of vector representations of words in a query or document (pre-trained using word2vec (Mikolov et al., 2013) on the corpora described below), and columns in the size of the length of the query or document. This choice of embeddings might seem simplistic compared to recent approaches to contextual word embeddings (Devlin et al., 2019;Peters et al., 2018), however, it is motivated by the goal of understanding the relative benefit of meta-text embeddings based on a manageable architecture for learning-to-rank with text-only embeddings. Let x 1:n = [ x 1 ; x 2 ; . . . ; x n ] be the concatenation of word vectors for a query or a document of n words, where each word vector is of dimensionality k, and let x i:i+h−1 = [ x i ; x i+1 ; . . . ; x i+h−1 ] denote the concatenation of word vectors in a window of width h starting from position i. The parameters of a convolution involve a filter W ∈ R hk×m which is applied to a window of h words, and extracts vectors and p i ∈ R m , b ∈ R m and f is a non-linear function such as hyperbolic tangent. The final feature representation extracts an m-dimensional vector c by average pooling over time where c = avg 1<i≤l p i . Applying this procedure to "sentence matrices" consisting of concatenations of word representations of queries or documents yields our representations c q and c d . A further step involves learning a multi-layer perceptron with a fully connected layer on top of the concatenation [ c q ; c d ], defining a relevance score where O ∈ R 1×s , U ∈ R s×2m , and s is the dimensionsonality of the hidden state.
Using the shorthand Θ = { W , b, O, U } for the parameters to be learned, we can write the function to be optimized as the pairwise ranking objective where d + and d − are relevant and non-relevant document respectively. During training, the model learns the convolutional filters and the similarity function. An overview of the overall network architecture is given in Figure 1.  Figure 1: Neural learning-to-rank architecture with embeddings of text (q,d) and categories (q cat ,d cat ) of queries and documents. CNN components generate compressed representations of texts (c q ,c d ) and categories (c cat q ,c cat d ), the MLP component learns a relevance score S between queries and documents.

Incorporating Convolutional Embeddings of Meta-Textual Data
To enrich the representation of queries and documents, we further generate dense vectors, i.e. embeddings, for meta-textual information. In the case of Wikipedia, we make use the category graph to extract embeddings that potentially add useful information to the retrieval task. We opt for a machine learning method to generate graph embeddings for each node in our category graph. From the several available methods for learning graph embeddings we follow the DeepWalk-idea described by Perozzi et al. (2014). Our DeepWalk-based strategy consists of 3 steps: 1. For all categories in the category graph, apply a random walk of predefined length to generate context sequences.
2. Apply word2vec's (Mikolov et al., 2013) skip-gram with negative sampling method to learn a model that predicts the context given a category.
3. Use the trained model to calculate an embedding for each category.
More computationally advanced algorithms such as GraphGAN (Ribeiro et al., 2017) or struc2vec (Wang et al., 2018) show better performance on the Wikipedia node classification task, however, the improvements compared to DeepWalk are small: 1-2% in accuracy, 0.01 in Macro-F1. 1 As a matter of fact, the Wikipedia category graph is noisy and contains oddities like loops, or missing and wrong connections. We thus chose this method for its simplicity and robustness. We did not apply any weighting of node connections as suggested in Node2vec (Grover and Leskovec, 2016). A single random walk of length 40 was generated for each category instance. All hyperparameters were tuned on the dev set.
Each query's and document's category information is now encoded as a set of embeddings. Typically, the number of elements in these sets is much smaller than the words in the associated texts, but we nevertheless reused the convolutional architecture for words to get compressed category representations. The concatenation of n categories is then processed via convolutions x cat 1:n = [ x cat 1 ; x cat 2 ; . . . ; x cat n ]. In contrast to processing of word sequences, we eliminate order specific information for categories by fixing the window size to h = 1, effectively treating the ordered category sequence as an unordered set. Training a convolution p cat i = f ( x cat i · W cat + b cat ) followed by average pooling yields our compressed category representations c cat q and c cat d . As before, we fix the embeddings and train only the convolutional parameters and the similarity function S to obtain meaningful representations of categories for the given queries and documents.

Data Sets
For our CLIR experiments, we conduct experiments on two different domains, namely Wikipedia articles and patents. For the Wikipedia retrieval task, we extend data from Sasaki et al. (2018) with category in- formation from Wikipedia. For the patent retrieval task, we extend data previously published by Sokolov et al. (2013) with information from the International Patent Classification (IPC) system. Table 1 lists the statistics of the extended datasets. Both data extensions are publicly available. 2

Cross-Lingual Retrieval in Wikipedia
The task of cross-lingual retrieval on Wikipedia data is as follows: given a query in a source language, identify the corresponding article in the target language and all other articles that are relevant to this target article, i.e. having incoming and outgoing links to the article in the target language. Pages that connect articles which are irrelevant to each other, e.g. disambiguation pages, were removed from the dataset. Removal of seemingly useless categories such as stub, tracking, or maintenance categories had no effect on the retrieval performance in the English-Japanese case, thus we did not apply any category filtering on any language pair in our final experiments.

Multiple Relevance Levels
The textual data we use is taken from Sasaki et al. (2018). In their approach, they explicitly consider only the highly relevant articles, i.e. articles that have a direct inter-language link (relevance level r = 2) in Wikipedia. However, the data set they published contains additional relevance judgements for articles that are considered lesser relevant, i.e. articles that are linked from the most relevant article and contain a link referring back to that article (relevance level r = 1). Our approach makes use of this additional information by extending the learning-to-rank strategy to not only compare relevant (d + ) and irrelevant (d − ) documents, but to include a comparison between highly relevant (d r=2 ) and lesser relevant (d r=1 ) documents. To keep the data size manageable during training, we sample a subset of pairs from the set of all possible pairings.

Category Overlap
To estimate the effectiveness of the random walk strategy for generating category embeddings and to investigate the overall utility of such embedded categories, we computed the category overlap between queries and relevant documents. This number is calculated as the intersection of translated query categories (using the Wikipedia interlanguage-links) and the categories of the candidate document. As shown in Figure 2, the category overlap aggregated over documents of relevance levels r = 1 and r = 2 is 0 for over 70% of cases across all language pairs. A deeper analysis shows that the category overlap between a query and a highly relevant document (r = 2) is in general higher. For example, on the English-Japanese retrieval task, we observe about 14% of all query-document pairs with a category match between 90-100%. However, in 43% percent of the cases, there is no category overlap even for highly relevant documents. The majority of cases with 0 category overlap is due to the overlap between a query and its lesser relevant documents (r = 1). For English-Japanese, no overlap is found in 76% of the cases. For the English-German and the English-French retrieval task, the overlap is even lower with 86-87% pairs having no category overlap. These results emphasize the necessity for inexact matching methods such as our learned similarity metric over category embeddings.

Patent Prior Art Search
Cross-lingual patent retrieval is a classic task in CLIR and economically extremely relevant. If a company wants to file a patent application, it is important that the new patent cites all previous patents that are relevant to the claim of its originality. The task of identifying relevant patents is called "patent prior art search". In practice, the patent applicant adds all citations that are relevant to the best of his knowledge, and then this list is refined by patent examiners specifically trained on certain areas of technology.

Patent Data
Relevance levels were taken from the publicly available BoostCLIR dataset (Sokolov et al., 2013). This is a bilingual Japanese-English corpus extracted from the MAREC 3 patent data, and the data from the NTCIR PatentMT 4 workshop. We make use of all three levels of relevance available: highest relevant patents are family patents (relevance level r = 3), very relevant patents are the ones cited in search reports by patent examiners (r = 2), and lowest relevant patents are applicants' added citations (r = 1). While we use the same distribution of patents for train/dev/test as in BoostCLIR, we changed query and the document content. In the original dataset, both queries and documents consist of patent abstracts. However, a patent's technical terms, the scope, and the extent of protection are defined by its claims. We thus evaluated a what we think more realistic scenario where the textual part of queries is represented by the patent's title and abstract, and the textual part of documents is represented by the patent's first claim.

International Patent Classification
Meta information for our patent retrieval task consists of the International Patent Classification (IPC) codes. 5 More precisely, we use the sub-group level of the IPC based domestic classification ECLA for English and FI for Japanese. Our dataset contains 50,329 unique IPC subdivisions for English and 23,020 for Japanese. We employ the same DeepWalk training strategy to learn classification embeddings as for Wikipedia categories. For strictly hierarchical graphs like the IPC-tree, there exist algorithms that better capture hierarchical structures (Nickel and Kiela (2017), Alsuhaibani et al. (2019), Li et al. (2016), inter alia). However, we still chose our previous method for its simplicity and for comparability to previous results. Evaluating truly hierarchical embeddings for patent retrieval is planned as future work.

Experiments
To evaluate the efficacy of our model, we conduct experiments on Wikipedia data for three language pairs, and on patent data for a single language pair. We conduct additional experiments in the patent domain to evaluate alternative models that combine information differently. Finally, we integrate standard tf-idf and evaluate if our models scale up to realistic retrieval scenarios in prior art search.

Experimental Setup for Wikipedia Retrieval
We evaluate our model on a cross-lingual retrieval task where we enrich the data provided by Sasaki et al. (2018) with category information from Wikipedia. The textual part of their published data of artificially generated source queries (English) and shortened target documents (Japanese, German, French) remains untouched. However, our model integrates the lesser relevant documents (r = 1) and compares them to the highest relevant documents (r = 2) and irrelevant documents (r = 0). We remove all punctuation, and tokenize and lowercase both queries and documents. The same procedure was applied to the Wikipedia corpus consisting of complete articles that we used for unsupervised pre-training of language-dependent word embeddings with gensim (Řehůřek and Sojka, 2010). The statistics of our data are as follows. During training, we use 14.8k documents with r = 2 and 15.7k with r = 1 for En-Ja, 32.6k and 34.4k for En-De, and 38.0k and 29.8k for En-Fr, respectively. For practical reasons we do not make use of all available documents with r = 1 but sample from the set. We combine relevant documents with 4 irrelevant documents (r = 0) for pairwise ranking during training. Depending on the language pair, we evaluate on average 5-8 times more irrelevant than relevant documents per query while testing. The number of training instances per language pair are 899k, 1.976M, and 2.196M, for En-Ja, En-De, and En-Fr, respectively. The numbers of instances in the dev and test sets are similar, because here we take more irrelevant documents into account (10-15 times more irrelevant than relevant documents).
Our model uses word embeddings of 100 and category embeddings of 30 dimensions. Kernel size for word convolutions is 4, for category convolutions 1. All hyperparameters were tuned on the dev set. The similarity function's MLP implements a 4-layer architecture with 1,600 hidden dimensions where we apply a dropout (Srivastava et al., 2014) rate of 0.5 during training. We trained for 20 epochs with Adam (Kingma and Ba, 2015) and selected the best model on the dev set (early stopping). All embeddings were pre-trained and kept fixed during optimization of our ranking model.
We evaluate our models on three different runs with different random seeds, such that irrelevant (r = 0) and lesser relevant articles (r = 1) are varying in the training and dev sets for different runs. For the test set, which always contains all relevant articles, we varied the number of irrelevant articles from 40 to 1,000 to increase the difficulty of the task. Table 2 shows NDCG (Järvelin and Kekäläinen, 2002) results for learning-to-rank on a Wikipedia crosslingual retrieval task for the language pairs English-Japanese (En-Ja), English-German (En-De), English-French (En-Fr). We use the original trec eval script 6 version 9.07 for evaluation and report standard Table 3: NDCG results on the Japanese-English patent retrieval task for ranking with text and text+meta embeddings, and against different numbers of irrelevant documents. Scores are averaged over six runs. For clarity, we list only the first three runs. The model type "joint" learns convolutional filters and the similarity function jointly for text and meta data, while the model type "stacked" combines the individual scores of "text only" and "meta only" models linearly. Significance levels are calculated for corresponding runs (1/2/3), where † denotes p < 0.0001. Baseline is the "text only" model. NDCG without cut-off. The column labeled text only reports results for applying the model of Sasaki et al. (2018) that learns textual convolutions and the similarity function as part of the learning-to-rank task to our multi-level dataset. The best results of our experiments are listed in column text+meta. They are obtained by jointly learning the convolutional matrices and the deep layer of the ranking score model (parameters O, U ). As shown in Table 2, we achieve significant gains of up to 2 NDCG points averaged over three runs, depending on the language pair and test set sizes. Gains for individual runs were up to 2.6 NDCG points. Significance levels were obtained by running a paired randomization test on corresponding runs as described by Smucker et al. (2007).

Results on Wikipedia
Across all language pairs, the integration of category embeddings shows sigificant improvements over the "text only" baseline. The absolute improvement is even higher when the difficulty of the task is increased by adding more irrelevant documents to the test set, emphasizing the utility of our pre-trained embeddings for the cross-lingual retrieval task. At the same time, we observe noticeable smaller gains for the English-Japanese pair, which can be explained by the considerably smaller amount of category information in the Japanese version of Wikipedia (172k categories) compared to the French and German version (332k and 282k categories, respectively).

Experimental Setup for Patent Search
For cross-lingual prior art search, we trained a model similar to the Wikipedia model, i.e., embedding dimensions of 100 for text and 30 for meta-information, and a 2-layer similarity network with 3,200 units for the hidden layer. IPC codes were extracted from the MAREC corpus, and we apply the same train/dev/test splits as Sokolov et al. (2013). During training, we again pair all relevant patents with 4 sampled irrelevant patents to jointly learn the convolutional filters and the similarity function. As before, we evaluate our models on multiple different runs with different random seeds. For the test set, which always contains all relevant articles, we varied the number of irrelevant articles from 40 to 1,000 to increase the difficulty of the task.
Our first experiment evaluates an architecture similar to the one applied on Wikipedia data, while our second experiment combines text and meta-textual models in an ensemble. Following standard practice in CLIR, we conducted a third experiment that selects irrelevant documents based on highest tfidf scores (Jones, 1972) between the Google-translated query and the target document (selection criterion "tf-idf"). After selecting the top-n (n = 40, 200, 400, 1000) irrelevant documents with highest tf-idf score, all relevant documents are added to the list. We then applied standard reranking strategy as well as a weighted reranking strategy where the final score is a linear combination of tf-idf and ranking score. Table 4: NDCG results on the Japanese-English patent retrieval task for ranking with tf-idf, text and text+meta embeddings, and against different numbers of irrelevant documents. Average scores are calculated over six runs. For clarity, we list only the first three runs. Here, irrelevant documents are preselected per query based on high tf-idf score. The "reranking" strategy reports numbers by applying the ranking models on the pre-selected list, while the "weighted reranking" strategy reports results based on linear combinations of tf-idf and ranking models. The penultimate column list the difference to the "text avg." and "text+tf-idf" baselines for "reranking" and "weighted reranking", respectively. The last column lists the difference to plain tf-idf ranking. Significance levels are calculated for corresponding runs (1/2/3), where † denotes p < 10 −6 .

Ranking Models
Our results listed in the upper half of Table 3 show gains for cross-lingual prior art search similar to our experiments integrating meta-information in CLIR on Wikipedia. The gains on our simplest test set with only 40 irrelevant documents, however, are negligible. This is verified by a paired randomization test (Smucker et al., 2007) to determine significance levels, where none of the three models achieved a p-value below 0.01 when compared to their corresponding "text only" systems. Increasing the number of irrelevant documents makes the task harder and with more than 400 irrelevant documents two out of three systems are significantly better than their baseline systems. One noticeable difference to the Wikipedia results is the performance of the "meta only" system: models solely based on IPC-class embeddings ("meta average") give much higher retrieval performance than the ones based on Wikipedia category embeddings. We thus evaluated an alternative model we call "stacked" (inspired by Wolpert (1992)), that linearly combines the scores of individual "text only" and "meta only" models. Our weights were determined by grid search, but they could be in principle optimized using machine learning techniques.
Interestingly, the combination of text and meta information works far better for the stacked model than for the joint model. We ported back the stacking idea to the Wikipedia retrieval task but observed only minimal changes there. We hypothesize that different orthogonal information sources might not always be captured optimally in a joint model, especially in cases where each single information type already provides a strong signal. In such cases a similarity metric that learns how to optimally compare such information across languages is beneficial. A joint similarity function that is learned on concatenated representations as in the "joint" model might blur useful properties of orthogonal information types. Table 4 lists the results of the experiments that integrate tf-idf into the retrieval task. In these experiments, tf-idf is applied to generate subsets of 40, 200, 400, or 1,000 highest scoring documents, and the documents are not randomly sampled. The ranking models are then applied on these subsets. Due to the structure of the original BoostCLIR dataset we could not employ a ranking experiment on the full dataset, but we calculated tf-idf for all 100k+100k patents listed in the provided dev and test document files of the corpus. We then evaluated two retrieval scenarios: in our "reranking" approach, we rerank the subset of documents that were selected via tf-idf, while in our "weighted reranking" approach a linear combination of tf-idf scores and model scores is applied to this subset. In comparison to the previous results, the resulting NDCG scores are considerably lower but reflect a realistic scenario: tf-idf is used to select subsets of top-k (k = 40, 200, 400, or 1,000) highest scoring documents, and then the ranking models are applied to this sets. Combining this strategy with standard IR techniques such as inverted indexing makes our ranking models applicable to large-scale datasets.

Scaling-up Retrieval with Weighted Reranking
The benefit of including meta information is clearly visible across all subset sizes ("∆ to text" in Table 4). It is important to note that the tf-idf score between the documents and the Google-translated queries is a strong baseline as illustrated in column "∆ to tf-idf" of Table 4. Our experiments that implement a standard reranking strategy, i.e. rerank a pre-selected subset, were only able to surpass plain tf-idf in the one case where the number of documents is rather small, i.e. 40 documents.
Thus, we reapplied the stacking idea described in the previous section and evaluated combinations of model scores and tf-idf scores. For all combinations, the weights were determined via grid search on dev ( Figure 3). Again, we observe remarkable gains in retrieval performance when models that each provide strong signals are "stacked" together. This time, the tf-idf baseline is outperformed in all cases by a large margin of 5.3-7.9 NDCG points. The positive contribution of integrating meta information is not as large as in the standard reranking experiment, however, it is still clearly evident on this relatively high level.

Conclusion
We presented an approach to incorporate retrieval-relevant meta-textual information into learning-to-rank models. Such information is readily available in many retrieval tasks and can be integrated by learning dense embeddings that allow for inexact matching. Our main motivation was to investigate the relative benefit of incorporating meta information into a manageable neural architecture presented in previous work (Sasaki et al., 2018). If more advanced and potentially better performing embedding models like BERT (Devlin et al., 2019) or ELMo (Peters et al., 2018) can be further enhanced by meta-textual information is an open question for future work.
Our results in the Wikipedia domain show that adding category information yields significant gains across several language pairs. The contribution of meta information increases with the size of the documents' lists, but is also dependent on the amount of meta information. In the patent domain, adding patent classification embeddings shows significant improvements over the textual models. Existing models that make use of the tf-idf metric can be further improved if the system components are combined properly. Integrating the tf-idf metric also makes the ranking models applicable to large scale retrieval if a machine translation system is available to translate queries into the target language.
Finally, our experiments on two different data sets and up to three language pairs showed that combining different types of information in retrieval is not straightforward. Joint learning worked particularly well for integrating noisy Wikipedia categories, while for the cleaner hierarchical patent IPC classifications we observed the largest gains when individual models are are combined in an ensemble setup. On patent data, joint models were significantly better than single models, but weighted ensembles outperformed joint models by a large margin.