UvA-DARE (Digital Academic Repository) Effective distributed representations for academic expert search

Expert search aims to ﬁnd and rank experts based on a user’s query. In academia, retrieving experts is an efﬁcient way to navigate through a large amount of academic knowledge. Here, we study how different distributed representations of academic papers (i.e. embeddings) impact academic expert retrieval. We use the Microsoft Academic Graph dataset and experiment with different conﬁgurations of a document-centric voting model for retrieval. In particular, we explore the impact of the use of contextualized embeddings on search performance. We also present results for paper embeddings that incorporate citation information through retroﬁtting. Additionally, experiments are conducted using different techniques for assigning author weights based on author order. We observe that using contextual embeddings produced by a trans-former model trained for sentence similarity tasks produces the most effective paper representations for document-centric expert retrieval. However, retroﬁtting the paper embed-dings and using elaborate author contribution weighting strategies did not improve retrieval performance.


Introduction
To help navigate a large body of academic knowledge, it can be useful to identify expert individuals.Identifying such individuals may be useful to find collaborators (Zhan et al., 2011;Schleyer et al., 2012;Sziklai, 2018), to find paper reviewers (Silva, 2014;Price and Flach, 2017), to find supervisors (Alarfaj et al., 2012a), or to investigate literature in a certain domain.This process of identifying experts given a particular topic is called expert finding (Balog et al., 2009), expertise retrieval (Gonc ¸alves and Dorneles, 2019), or expert search.Expert search systems are information retrieval systems that can automatically rank candidate experts based on their expertise on a certain subject (Husain et al., 2019).In this study, we target the domain of retrieving academic experts based on papers they authored.
Given the central role of papers to defining expertise in this domain, we focus on document-centric expert search systems (Balog et al., 2006).These systems largely rely on statistical language modeling, topic modeling, or term frequency-based approaches to represent documents (Gonc ¸alves and Dorneles, 2019;Husain et al., 2019).Surprisingly, given the rapid advances in the field of contextualized text embeddings (Wang et al., 2020b), little work has been done in applying these approaches to document representation for this task.We hypothesize that considering single words, which is common in the bag-of-words and probabilistic termbased approaches, may significantly reduce the system's "understanding" of the underlying academic documents.To achieve a potentially deeper understanding of these papers, contextualized text embeddings could be used.
Thus, in this paper, we explore the impact of contextualized text embeddings on the performance of the expert search.Specifically, we make the following contributions: • a comparison of expert search performance using token-based (i.e.BERT (Devlin et al., 2018)) and sentence-based (Sentence-BERT (Reimers and Gurevych, 2019)) contextualized embeddings, non-contextualized embeddings (e.g.GloVe (Pennington et al., 2014)) and classic term frequency representations; • measurement of the impact on performance when incorporating citation information into contextualized representations through retrofitting (Faruqui et al., 2015); and (Zhang, 2019).
• a comparison of two different strategies for combining embeddings of the title and abstract of papers.
Additionally, all experiments are conducted using different techniques for assigning author weightings based on author order.Overall, this paper provides evidence for the efficacy of contextualized embeddings for the task of academic expert search.Note that this paper primarily focuses on investigating the performance of various contextualized embeddings and expert ranking aggregation methods within expert retrieval, and not on the entire retrieval process.Therefore, some aspects of neural information retrieval systems such as query understanding, query expansion, or reranking are out of the scope of this study.
Source code for the methods and data processing used in this paper can be found at https://github.com/mabergerx/SDP500_expert_search.The processed data used by our methods is available at (Berger, 2020).
The rest of this paper is organized as follows.We begin with a discussion of related work.Afterwards, the data used in this study is described.This is followed by a description of the various embeddings used and our approach to author ranking.Section 6 defines the evaluation and Section 7 details its results.We, then, briefly describe a prototype implementation using these representations.Finally, we discuss the limitations of the work, potential future work and conclude.

Related work
In this section, we introduce the primary paradigm for expert search.We then discuss work on voting models, document representations, and the use of text embedding techniques within expert search.
Probabilistic models A driving force behind expertise retrieval research was the launch of the TREC Enterprise Track in 2005 (Craswell et al., 2005).This evaluation campaign led to the emergence of probabilistic models, in particular in the form of language models, as the primary paradigm for expertise retrieval.The core idea behind these approaches is to estimate a language model for each document and then rank the documents by the likelihood of the user query according to the language models (Balog et al., 2009).

Voting models
We can see documents authored by experts as evidence for their expertise.A partic-ular type of models, based on data fusion methods that aggregate document scores into expert rankings, are called voting models (Husain et al., 2019;Balog et al., 2012).
Given a query, the retrieved documents are assumed to provide evidence about a possible ranking of the authors.This aggregation of the final author list can then be modelled as a voting process, where the document scores are aggregated into author scores (Macdonald, 2009;Macdonald andOunis, 2008, 2006b,a).
Paper embeddings Document-centric expert search systems rely on the documents to aggregate an expert ranking.However, effectively embedding longer documents is still an open research problem (Beltagy et al., 2020;Zhang et al., 2016;Liu and Lapata, 2017).
Unsupervised document embedding techniques include Sent2Vec (Pagliardini et al., 2018) and Doc2VecC (Chen, 2017), while supervised document embedding techniques include the Universal Sentence Encoder (Cer et al., 2018) and InferSent (Conneau et al., 2018).Recently, the Longformer (Beltagy et al., 2020) was proposed to embed even longer sequences of text than sentences.One research proposed evaluating various sentence encoding techniques in re-ranking of BM25-based research paper recommendations and found that the sentence encoding could be a beneficial method in addition to the BM25 retrieval, but not on its own (Hassan et al., 2019).Adding the BERT [CLS] token embedding into other ranking model's signal has been proposed and is shown to improve the underlying neural ranking architecture (MacAvaney et al., 2019).
As for the embedding of academic papers, most of the research focuses on learning the paper embeddings using linkage information and considers this a graph problem (Wang et al., 2016;Zhang et al., 2019;Mai et al., 2018).
Embedding expertise Given the amount of research on document embedding techniques, there has been surprisingly little attention given to the application of contextualized embedding techniques in the field of expertise retrieval.Three recent surveys and reviews on the field of expertise retrieval (Gonc ¸alves and Dorneles, 2019;Husain et al., 2019;Lin et al., 2017) contained little to no information about the application of embedding techniques.
One of the first works to introduce this concept into expertise retrieval was Author2Vec (J et al., 2016), which uses two models, the content-info model and the link-info model within the context of the co-authorship network.In the context-info model, the text of the written papers is represented using Paragraph2Vec (Le and Mikolov, 2014).
As briefly mentioned in the introduction, authors that cite each other can be considered having similar interests (Tho et al., 2007;Shibata et al., 2008).Zhang (Zhang, 2019) suggested using retrofitting in the domain of academic papers as a means of introducing this network information into the representation of a paper.Retrofitting is a concept introduced by Faruqui et al. (Faruqui et al., 2015) which proposes the incorporation of the information from semantic lexicons such as WordNet into word embeddings.

Data description
The Microsoft Academic Graph (MAG) (Wang et al., 2020a) was used at the primary data source.The data consists of over 200 million papers (titles & abstracts) as well as a variety of metadata.We accessed the November 2018 snapshot of the MAG data through the Open Academic Graph initiative1 , in particular the OAG v2 release.
Due to the very large size of the MAG, we created a custom subset of the data that mainly consisted of Computer Science (CS) related papers.This domain allows us to interpret results better than other science domains.
Our approach in extracting Computer Science (CS) related papers was to take the 113.864 paper titles obtained from arXiv2 -a widely used preprint server -and do an exhaustive title matching on the full MAG dataset.This search resulted in 29.237 exact title matches, which corresponds to 26,6% of the arXiv data.This set provided us with a substantial initial seed of papers to extract more CS papers from the MAG data.
To allow retrofitting later in the process and create a larger dataset, we expanded this set with the references of all the 29.237 papers, which resulted in a set of 221.347 papers.These references were retrieved by accessing the references field of each of the 29.237 paper in the MAG data.Note that these references are not necessarily always complete: some cited articles may not be present in our data due to incompleteness of the source MAG data.
From these 221.347 papers, we then performed bounded stratified sampling for the authors to retrieve a subset of 5.000 authors who are representative of both highly-, medium-and less prolific author populations.The full sampling method is described in Algorithm 5 in the Algorithms appendix.
This set of 5.000 authors served as a starting point for a second, final round of data retrieval.For these authors, we retrieved all their papers and references, resulting in a set of 127.716 papers, which included authors of the referenced papers.For all these new authors, we collected the metadata from the MAG authors dataset and aggregated this information into a single final authors dataset.The reason for expanding the set of authors beyond the 5.000 sampled authors is that a larger pool of papers is beneficial for the retrieval due to the larger search space.
For all titles and abstracts in our dataset, we performed data cleaning.Specifically, (corpusspecific) stopwords were removed, redundant whitespace and Unicode characters were both normalized.URLs and e-mails were removed.

Paper embedding methodology
In this section, we describe the paper embedding techniques employed and discuss our approach to embedding indexing and search, as well as retrofitting embeddings.

Embedding techniques
Various approaches have been used to embed the papers.We divide our approaches into the custom contextual approach and the baseline approaches.In all our baseline approaches, we use the concatenation of the title and the abstract to represent the paper.
Custom contextual approach We use the title and the abstract as representative texts for a paper.Although the title and abstract of a paper are both relevant representations, they may contain information that differs in importance and granularity.In order to capture the possible semantic weight differences between the title and the abstract, we deploy two different embedding combination strategies: the merge strategy and the separate strategy.
In the merge strategy, we assume that the semantic weights of the title sentence and the abstract sentences are equal.That means that we want to take the average over the title-and abstract sentence embeddings without assigning any extra weight to neither.A detailed specification is given in the Algorithm 1 in the appendix.
In the separate strategy, we do want to differentiate between the title and the abstract.In particular, we want to assign more weight to the title than to the individual abstract sentences.We achieve this by first computing the average abstract embedding and then taking the average between that and the title embedding.A detailed specification is given in the Algorithm 3 in the appendix.
We refer to the actual text embedding model as the embedder.The embedder of choice is Sentence-BERT (Reimers and Gurevych, 2019), which is specifically designed for producing meaningful sentence-level embeddings, suited for Semantic Textual Similarity (STS).Specifically, we make use of the RoBERTa-base model fine-tuned on the combination of NLI datasets, and then further finetuned on the STS benchmark training set3 .
Baseline approach: Latent Semantic Indexing (LSI) (Deerwester et al., 1990) TF-IDF vectors with applied singular value decomposition.For our experiments, we chose to set the LSI vector's dimensionality to 768 dimensions, the same as the Sentence-BERT embedding dimensionality.
Baseline approach: BERT-and GloVe pooling To provide a comparison between specifically tuned for document-level representations Sentence-BERT and conventional pooling document embedding techniques, we produced paper embeddings by averaging both BERT and GloVe token embeddings.In both averaging operations, we perform double pooling: first, all tokens within each sentence are embedded and averaged into a single sentence embedding, and then these sentence embeddings are once again averaged into a single paper embedding.The details of this embedding process are shown in Algorithm 2 in the appendix and is identical for both BERT and GloVe embeddings.For both BERT (bert-base-uncased) and GloVe embedding calculations, we used the Flair (Akbik et al., 2018) Python library.

Retrofitting
Authors that cite each other can be considered as having similar interests.In the context of having a semantic representation of expertise, it could be helpful to "expand" a paper embedding to broaden the expertise scope of the author beyond a particular paper.To achieve this broadening, we use a technique called retrofitting, which introduces network information into the embeddings.
Inspired by (Zhang, 2019), we adapt the original implementation4 of retrofitting (Faruqui et al., 2015) to work with academic papers that have been contextually embedded.The retrofitting process is performed for ten iterations.Algorithm 4 in the appendix shows the details about the algorithm.

Embedding storage and search
We chose the FAISS (Johnson et al., 2017) library by Facebook for our indexing purposes.It is optimized for memory usage and speed and can handle a large number of vectors.
For our embeddings, we chose the IndexHN-SWFlat index5 .We use cosine similarity as the measure of similarity between the query embedding Q and any of the indexed embeddings V .

Author ranking via voting
From the FAISS index we can, given a query, retrieve top N similar papers.To produce a final author ranking, we adopt a voting model based approach.
We can consider the retrieved paper results as the "expertise evidence" for the authors of these papers.A range of different voting approaches based on data fusion techniques has been proposed (Macdonald and Ounis, 2008;Afzal and Maurer, 2011;Alarfaj et al., 2012b) to produce an author ranking given the documents.
Each retrieved document d from the set of retrieved documents R(Q) has an associated similarity score s(d, Q) to it, with regard to the query Q.We can then combine these document scores into aggregated author scores using the ExpCombSUM (eCS) data fusion function (Macdonald, 2009): where C is a candidate expert, D C is the set of documents associated with candidate C.This algorithm (Macdonald and Ounis, 2008;Macdonald, 2009), assumes that each document produces a static score per related author.In the case of academic papers, that is not the case, as most papers have multiple authors.These authors mostly have a different level of involvement in a particular paper and, therefore, may have a different vote produced by the document, depending on their authorship role.Recent research has shown that because the research is increasingly more interdisciplinary, evaluating authors based on their rank within the order of authors is becoming increasingly difficult (Júnior et al., 2017).Therefore, it could be valuable to assign different weights to different authors of the same document.To the best of our knowledge, no previous work has been done on exactly defining weights on the authorship scores within a voting model.We define four different weighting strategies: 1. Binary weighting.Each author gets the full score for a document.This strategy assumes that each author contributed equally.
Each author gets fullScore # authors for a document.This strategy assumes that each author contributed equally but does normalize the score by the number of authors.
3. Descending weighting.The first author gets the full document score.Each following author gets fullScore * decayF actor, where decayF actor starts at 0.8 and decreases with 0.2 for each consecutive author.This strategy assumes that the authors are listed in descending involvement order.
4. Parabolic weighting.The first and last author get the full document score.All authors in between follow the descending weighting.This strategy assumes that the first author is similar to the descending weighting, but also takes the possible importance of the last author as the project supervisor.
Using these data fusion approaches, a fairness problem may occur: highly prolific authors, that may be associated with many documents (for instance, because they are the head of a lab) may receive an unfairly large number of votes, which does not necessarily indicate their expertise.Candidate length normalization is proposed to deal with this unfairness, just as document length normalization is often performed in document retrieval systems (Macdonald, 2009).
The use of a classical document normalization technique based on the Divergence From Randomness framework (Amati, 2003) is proposed (Macdonald, 2009), and has the following formula: where α is a hyperparameter controlling the amount of normalization, aL is the average amount of publications, and lP is the length of the profile of the candidate C. The lower the α parameter is, the more less prolific authors are boosted, and the more highly prolific authors are suppressed.
In practice, we discovered that because of the nature of our dataset, if we apply the above normalization technique, many authors from the long tail are retrieved, even when we use high α values.Therefore, we experimented with introducing another term to the equation: β, which serves as a profile length "booster": While it does introduce bias and eases the normalization, in our case, it provided an extra parameter to tune and resulted in a better mix of welland lesser-known authors.

Evaluation methods
To evaluate the different retrieval strategies, we developed a method which uses the field of work tags present in the MAG dataset for the authors as a proxy for evaluating the relevance of an author.Because we use a document-centric retrieval strategy which uses strictly only the embeddings of the paper's title and abstract, the field of work tags are not used in the retrieval process, allowing us to use these tags in the evaluation process.
The query test set for our research was selected from the full distribution of author tags.The final query test set contained a hundred Computer Science related queries.The set is available in Table 3 in the appendix.

Relevance metrics
Before we can use any of the existing binary information retrieval metrics, we need to define a notion of relevance of an author given a query.
Based on the field of work tags, we define two relevance checks: 1. Exact topic query evaluation (Brochier et al., 2018).This approach takes a description of a topic and uses it directly as a query.
The experts associated with that topic are then the ground truth list of candidates to be retrieved.In our case, the field of work tags of the authors are used as queries, and the retrieved authors are labelled relevant if they have that tag.
2. Approximate topic query evaluation.Sometimes, a query retrieves authors who have an incomplete field of work tag list or have tags which are very similar to the query but do not exactly match it.For instance, author A may have "automatic summarization" in their tags list, while the query was "automatic text summarization".This author is clearly highly relevant to the query but would be labelled as irrelevant by the exact topic query evaluation method.Therefore, we introduce a fuzzy relevance checking method which, given a query, calculates the cosine similarity between the query embedding and each of the author's tags.If any of the similarities are higher than a chosen threshold, then we deem the author relevant.
Once we can label each retrieved author as relevant or not relevant, we can use different evaluation metrics.We evaluate our system by using three binary relevance metrics, all measured @N and using both the exact and approximate topic query evaluation: Mean Reciprocal Rank (M RR@N exact , M RR@N approx ), Mean Precision @ N (M P @N exact , M P @N approx ), and Mean Average Precision (M AP @N exact , M AP @N approx ) Some authors are more relevant than others.To incorporate this into our evaluation, we also use the Normalized discounted cumulative gain (nDCG@N) score, which is sensitive to the position of the relevant items in the produced ranking.
To produce the nDCG scores, we first need to have the score for the ideal ranking given a query, IDCG.To calculate the IDCG scores, for each query in our test set, we created a mapping between the query and the corresponding top authors.
Each author in such mapping got a relevance label in relation to the query.In our implementation of the author's relevance, we use the citations of the relevant papers of an author as a proxy for the expertise.Although any expertise measure of an author is not fully objective, and many factors seem to have effect on the citation activity (Yan et al., 2011), we chose the citation counts of the papers as a proxy for expertise for the following reasons: • This measure is explainable.
• It prevents highly prolific but rarely cited authors to be labeled as highly relevant for multiple topics just based on their output Given this final query-to-expert mapping, we precalculated the IDCG@10 score for each of the test queries in our dataset, so we could later calculate the nDCG@10 score per query.Once we calculated all the scores for our test set, we can take the average of those scores to have a single nDCG@10 score for our current system.

Results
In this section, first the overall quantitative evaluation results are discussed and then we zoom in on the performance of retrofitting and author contribution weighting.

Voting model results
The results of the voting model approach are presented in Table 1.Here we only present the exact topic query evaluation results, as we observe that approximate topic query evaluation results correspond to the exact results but are overall higher.In particular, for MRR, the approximate results are 0.06 higher on average; for MAP, the approximate results are 0.16 higher; for MP@10 the approximate results are 0.15 higher; and finally the MP@5 approximate results are, again, 0.15 higher.The full results table with the approximate query evaluation results included, are presented in Table 2 in the Appendix.
We can observe that the LSI baseline produces strong results, outperforming both embedding pooling baselines.From the four used author contribution weighting schemes, the binary score weighting is the best performing weighting.However, the overall performance difference between the weightings is quite small.The best performing configuration is the separate embedding strategy with the binary distributed paper scores.We see that normalizing the expComb-SUM function with a low α and β = 0 leads to steep decrease in performance.With higher α and β, the performance goes up but can be explained by "cancelling out" the normalization effect.One of the reasons for this bad performance could be that the dataset contains many authors from the long tail; some of the authors naturally may have worse metadata resulting in missing expertise tags.Moreover, for lesser-known authors in the MAG, we have encountered a problem that the profiles get deleted or get different author ids, which also corrupts the metadata and the results.
Retrofitting the embeddings did not improve the results, except for two evaluation metrics: M RR@10 exact and M AP @10 exact in case of the retrofitted separate embeddings.

Performance of retrofitting
Retrofitting the embeddings did not improve our retrieval performance with the exception of two metrics.One of the explanations could be that the relatively small size of our dataset can hurt the retrofitting process.Combined with the variance in the number of neighbours per paper, some of the resulting retrofitted embeddings might be driven away too much from the original embedding, while other embeddings are not modified "enough".

Effect of different author contribution weightings
Throughout the experiment, we observed that the binary author contribution weighting performs the best.Therefore, we can conclude that, for our voting model configuration, introducing elaborate author contribution combinations does not improve the retrieval performance.

Prototype implementation
We implemented our model int a prototype, which consists of a REST API, where the users can, given a search query, look for N most relevant experts.Each individual expert representation is contained within a JSON object which contains not only the author's name and MAG id, but also the authors affiliation information (retrieved using GRID6 ), the list of papers which voted for the author and their corresponding document scores in relation to the query, and additional author information from WikiData (Vrandečić and Krötzsch, 2014).

Discussion & Future work
This study faced multiple challenges regarding data quality, data freshness, embedding strategy considerations, and the retrieval base on embeddings.In this section, we discuss these issues.
Data completeness and variability.The fieldof-work tags used in the evaluation are not always complete and are subject to the specific format used by the MAG.In addition, for some authors, no tags are present in our snapshot of the MAG data.This has an effect on the evaluation performance by introducing both false positives and -negatives into the relevancy determining process.Overall, our system would profit significantly from a larger pool of papers and authors.
Shifting expertise.Many authors have varying interests during their academic career.Our system is inherently a snapshot of their academic activity: we only search and aggregate within a bounded set of papers.Introducing temporal aspect into the author search, which would intelligently account for shifting expertise could improve the retrieval results.
Representing expertise domains.Many authors are not experts in just one niche field, but rather are knowledgeable about a pretty broad field of science, with more in-depth knowledge about a few specific sub-fields.Clustering within the author expertise to find expertise sub-clusters might help to create more nuanced author representations by taking the cluster centroids as the individual author embeddings.This approach, while interesting, requires more data, as clustering within the papers of one author requires having a significant amount of papers per author.
Performance of individual retrieval strategies.Different types of embeddings may perform well in different scenarios.For example, retrofitting of the paper embeddings leads to the "widening" of the semantic representation of a paper into the direction of its neighbours.For example, retrofitted embeddings may perform better for more broad/general queries and worse on more specific queries.The same applies for normalization in the voting model: retrieving less prolific authors may be beneficial for some user's needs, while it technically hurts the quantitative performance of the system.Document pooling strategies.In the two paper embedding strategies, we perform pooling over multiple sentence embeddings.While these strategies seem to work well to represent the papers for our task, it would still be interesting to use new, state-of-the-art approaches for embedding longer texts, such as the Longformer (Beltagy et al., 2020), to avoid using any pooling strategies and loosing semantic value.Techniques like the recently introduced SPECTER (Cohan et al., 2020) could also be used to produce better citation-informed document embeddings.Finally, we could employ the SciBERT (Beltagy et al., 2019) model in our baseline pooling strategies or even fine-tune a SciBERT model on longer sequences, similarly to Sentence-BERT, as it is better suited for academic texts.
Graph embeddings.In our approach, we used retrofitting to introduce citation network information into our initial paper embeddings.However, we could also go a step further and use graph embedding techniques to create native graph-based embeddings for our papers (Mai et al., 2018;Wang et al., 2016;Zhang et al., 2019).

Conclusion
In this study, we investigated different approaches for embedding academic papers and using the embeddings in an expert search task.
Overall, we found that Transformer-based contextual text embeddings work well on the domain of academic papers.By using the Sentence-BERT model trained on NLI and SNS tasks, we outperformed the strong LSI baseline often employed in information retrieval systems on all ten evaluation metrics.We also outperformed the baseline strategies of average pooled BERT and GloVe embeddings.
Employing a weighted embedding combination strategy to represent a paper can however be valu-able, as we found that using a separate embedding combination strategy outperformed the "default" merged strategy on nine out of ten metrics.
We hypothesized that enriching the paper embeddings with citation information, in a process called retrofitting, could improve improve retrieval performance.Our experiments did not confirm this hypothesis, as non-retrofitted embeddings performed better in our task on all but two evaluation metrics.
Finally, we employed various data fusion techniques to convert the top N retrieved papers given a query into a ranking of authors.A voting model was used, where each document served as evidence for the corresponding author's expertise.We investigated whether using author contribution weighting strategies within the voting process would improve expertise retrieval.We observed no performance gain over the "default" binary strategy.
Given the direction of this study, we think that the most suitable application areas for the methodology we proposed are reviewer finding, supervisor finding and investigating literature on a topic.The reasoning behind this is that finding a collaborator may require and involve more sophisticated information about the institution, availability and current field of interest of the found experts.The proposed three areas, however, allow for less specificity and can better benefit from the improved retrieval.
While research in the field of expertise retrieval is not as active in the second half of the 2010s as it was in the first half, the area of text representations and retrieval has seen dramatic improvements.This study was an effort to apply these new techniques into the field of expertise retrieval and has shown that substantial improvements can be made over the existing retrieval algorithms.We hope that this study can contribute to a new research wave within the field of (academic) expertise retrieval.

Algorithm 1 :
Paper embedding creation following the merge strategy.Result: Paper embeddings M E following the merge strategy Input: A set of paper titles T , a set of batches of abstract sentences A (with |T | = |A|) and an embedder model Embedder mergedEmbeddings ← [] abstractEmbeddingBatches ← [] titleEmbeddings ← Embedder.embed(T ) for aBatch ∈ A do batchEmbeddings ← Embedder.embed(aBatch)abstractEmbeddingBatches.append(batchEmbeddings) end assert |titleEmbeddings| = |abstractEmbeddingBatches| for tE ∈ titleEmbeddings, aEB ∈ abstractEmbeddingBatches do aEB.append(tE)N ← dim(aEB) i ∈ aEB) Take the element-wise average of all the embeddings, resulting in one embedding mergedEmbeddings.append(mergedEmbedding)end return mergedEmbeddings Algorithm 2: Paper embedding creation following baseline BERT or GloVe strategy.Result: Baseline paper embeddings M E created by either BERT or GloVe embedding model.Input: A set of batches of abstract sentences A with elementwise appended corresponding paper titles T (with |T | = |A|) and an embedder model Embedder (either BERT or GloVe) embeddings ← [] abstractEmbeddingBatches ← [] for aBatch ∈ A do batchEmbeddings ← Embedder.embed(aBatch)abstractEmbeddingBatches.append(batchEmbeddings) end for aEB ∈ abstractEmbeddingBatches do N ← dim(aEB)Take the element-wise average of all the embeddings, resulting in one embedding embeddings.append(pooledEmbedding)end return embeddings

Table 1 :
Results for the voting model author retrieval strategy.The best results are formatted in bold.