PACRR: A Position-Aware Neural IR Model for Relevance Matching

In order to adopt deep learning for information retrieval, models are needed that can capture all relevant information required to assess the relevance of a document to a given user query. While previous works have successfully captured unigram term matches, how to fully employ position-dependent information such as proximity and term dependencies has been insufficiently explored. In this work, we propose a novel neural IR model named PACRR aiming at better modeling position-dependent interactions between a query and a document. Extensive experiments on six years’ TREC Web Track data confirm that the proposed model yields better results under multiple benchmarks.


Introduction
Despite the widespread use of deep neural models across a range of linguistic tasks, to what extent such models can improve information retrieval (IR) and which components a deep neural model for IR should include remain open questions.In ad-hoc IR, the goal is to produce a ranking of relevant documents given an open-domain ("ad hoc") query and a document collection.A ranking model thus aims at evaluating the interactions between different documents and a query, assigning higher scores to documents that better match the query.Learning to rank models, like the recent IRGAN model (Wang et al., 2017), rely on handcrafted features to encode query document interactions, e.g., the relevance scores from unsupervised ranking models.Neural IR models differ in that they extract interactions directly based on the queries and documents.Many early neural IR models can be categorized as seman-tic matching models, as they embed both queries and documents into a low-dimensional space, and then assess their similarity based on such dense representations.Examples in this regard include DSSM (Huang et al., 2013) and DESM (Mitra et al., 2016).The notion of relevance is inherently asymmetric, however, making it different from well-studied semantic matching tasks such as semantic relatedness and paraphrase detection.Instead, relevance matching models such as Match-Pyramid (Pang et al., 2016), DRMM (Guo et al., 2016) and the recent K-NRM (Xiong et al., 2017) resemble traditional IR retrieval measures in that they directly consider the relevance of documents' contents with respect to the query.The DUET model (Mitra et al., 2017) is a hybrid approach that combines signals from a local model for relevance matching and a distributed model for semantic matching.The two classes of models are fairly distinct.In this work, we focus on relevance matching models.
Given that relevance matching approaches mirror ideas from traditional retrieval models, the decades of research on ad-hoc IR can guide us with regard to the specific kinds of relevance signals a model ought to capture.Unigram matches are the most obvious signals to be modeled, as a counterpart to the term frequencies that appear in almost all traditional retrieval models.Beyond this, positional information, including where query terms occur and how they depend on each other, can also be exploited, as demonstrated in retrieval models that are aware of term proximity (Tao and Zhai, 2007) and term dependencies (Huston and Croft, 2014;Metzler and Croft, 2005).Query coverage is another factor that can be used to ensure that, for queries with multiple terms, top-ranked documents contain multiple query terms rather than emphasizing only one query term.For example, given the query "dog adoption requirements", unigram matching signals correspond to the occurrences of the individual terms "dog", "adoption", or "requirements".When considering positional information, text passages with "dog adoption" or "requirements for dog adoption" are highlighted, distinguishing them from text that only includes individual terms.Query coverage, meanwhile, further emphasizes that matching signals for "dog", "adoption", and "requirements" should all be included in a document.
Similarity signals from unigram matches are taken as input by DRMM (Guo et al., 2016) after being summarized as histograms, whereas K-NRM (Xiong et al., 2017) directly digests a query-document similarity matrix and summarizes it with multiple kernel functions.As for positional information, both the MatchPyramid (Pang et al., 2016) andlocal DUET (Mitra et al., 2017) models account for it by incorporating convolutional layers based on similarity matrices between queries and documents.Although this leads to more complex models, both have difficulty in significantly outperforming the DRMM model (Guo et al., 2016;Mitra et al., 2017).This indicates that it is non-trivial to go beyond unigrams by utilizing positional information in deep neural IR models.Intuitively, unlike in standard sequencebased models, the interactions between a query and a document are sequential along the query axis as well as along the document axis, making the problem multi-dimensional in nature.In addition, this makes it non-trivial to combine matching signals from different parts of the documents and over different query terms.In fact, we argue that both MatchPyramid and local DUET models fail to fully account for one or more of the aforementioned factors.For example, as a pioneering work, MatchPyramid is mainly motivated by models developed in computer vision, resulting in its disregard of certain IR-specific considerations in the design of components, such as pooling sizes that ignore the query and document dimensions.Meanwhile, local DUET's CNN filters match entire documents against individual query terms, neglecting proximity and possible dependencies among different query terms.
We conjecture that a suitable combination of convolutional kernels and recurrent layers can lead to a model that better accounts for these factors.In particular, we present a novel re-ranking model called PACRR (Position-Aware Convolutional-Recurrent Relevance Matching).Our approach first produces similarity matrices that record the semantic similarity between each query term and each individual term occurring in a document.These matrices are then fed through a series of convolutional, max-k-pooling, and recurrent layers so as to capture interactions corresponding to, for instance, bigram and trigram matches, and finally to aggregate the signals in order to produce global relevance assessments.In our model, the convolutional layers are designed to capture both unigram matching and positional information over text windows with different lengths; k-max pooling layers are along the query dimension, preserving matching signals over different query terms; the recurrent layer combines signals from different query terms to produce a query-document relevance score.Organization.The rest of this paper unfolds as follows.Section 2 describes our approach for computing similarity matrices and the architecture of our deep learning model.The setup and results of our extensive experimental evaluation can be found in Section 3, before concluding in Section 4.

The PACRR Model
We now describe our proposed PACRR approach, which consists of two main parts: a relevance matching component that converts each querydocument pair into a similarity matrix sim |q|×|d| and a deep architecture that takes a given querydocument similarity matrix as input and produces a query-document relevance score rel (q, d).Note that in principle the proposed model can be trained end-to-end by backpropagating through the word embeddings, as in (Xiong et al., 2017).In this work, however, we focus on highlighting the building blocks aiming at capturing positional information, and freeze the word embedding layer to achieve better efficiency.The pipeline is summarized in Figure 1.

Relevance Matching
We first encode the query-document relevance matching via query-document similarity matrices sim |q|×|d| that encodes the similarity between terms from a query q and a document d, where sim ij corresponds to the similarity between the i-th term from q and the j-th term from d.When using cosine similarity, we have sim ∈ [−1, 1] |q|×|d| .As suggested in (Hui et al., 2017), query-document similarity matrices preserve a rich signal that can be used to perform relevance matching beyond unigram matches.In particular, n-gram matching corresponds to consecutive document terms that are highly similar to at least one of the query terms.Query coverage is reflected in the number of rows in sim that include at least one cell with high similarity.The similarity between a query term q and document term d is calculated by taking the cosine similarity using the pre-trained1 word2vec (Mikolov et al., 2013).The subsequent processing in PACRR's convolutional layers requires that each query-document similarity matrix have the same dimensionality.Given that the lengths of queries and documents vary, we first transform the raw similarity matrices sim |q|×|d| into sim lq×l d matrices with uniform l q and l d as the number of rows and columns.We unify the query dimension l q by zero padding it to the maximum query length.With regard to the document dimension l d , we describe two strategies: firstk and kwindow.
PACRR-firstk.Akin to (Mitra et al., 2017), the firstk distillation method simply keeps the first k columns in the matrix, which correspond to the first k terms in the document.If k > |d|, the remaining columns are zero padded.

PACRR-kwindow.
As suggested in (Guo et al., 2016), relevance matching is local.Document terms that have a low query similarity relative to a document's other terms cannot contribute substantially to the document's relevance score.Thus relevance matching can be extracted in terms of pieces of text that include relevant information.That is, one can segment documents according to relevance relative to the given query and retain only the text that is highly relevant to the given query.Given this observation, we prune query-document similarity cells with a low similarity score.In the case of unigrams, we simply choose the top l d terms with the highest similarity to query terms.In the case for text snippets beyond length n, we produce a similarity matrix sim n lq×l d for each query-document pair and each n, because n consecutive terms must be co-considered later on.For each text snippet with length n in the document, kwindow calculates the maximum similarity between each term and the query terms, and then calculates the average similarity over each nterm window.It then selects the top k = l d /n windows by averaging similarity and discards all other terms in the document.The document dimension is zero padded if l d /n is not a multiple of k.When the convolutional layer later operates on a similarity matrix produced by kwindow, the model's stride is set to n since it can consider at most n consecutive terms that are present in the original document.This variant's output is a simi-larity matrix sim n lq×l d for each size n.

Deep Retrieval Model
Given a query-document similarity matrix sim lq×l d as input, our deep architecture relies on convolutional layers to match every text snippet with length n in a query and in a document to produce similarity signals for different n.
Subsequently, two consecutive max pooling layers extract the document's strongest similarity cues for each n.Finally, a recurrent layer aggregates these salient signals to predict a global query-document relevance score rel (q, d).
Convolutional relevance matching over local text snippets.The purpose of this step is to match text snippets with different length from a query and a document given their query-document similarity matrix as input.This is accomplished by applying multiple two-dimensional convolutional layers with different kernel sizes to the input similarity matrix.Each convolutional layer is responsible for a specific n; by applying its kernel on n×n windows, it produces a similarity signal for each window.When the firstk method is used, each convolutional layer receives the same similarity matrix sim lq×l d as input because firstk produces the same similarity matrix regardless of the n.When the kwindow method is used, each convolutional layer receives a similarity matrix sim n lq×l d corresponding to the convolutional layer with a n × n kernel.We use l g −1 different convolutional layers with kernel sizes 2 × 2, 3 × 3, . . ., l g × l g , corresponding to bi-gram, tri-gram, . . ., l g -gram matching, respectively, where the length of the longest text snippet to consider is governed by a hyperparameter l g .The original similarity matrix corresponds to unigram matching, while a convolutional layer with kernel size n×n is responsible for capturing matching signals on n-term text snippets.Each convolutional layer applies n f different filters to its input, where n f is another hyperparameter.We use a stride of size (1, 1) for the firstk distillation method, meaning that the convolutional kernel advances one step at a time in both the query and document dimensions.For the kwindow distillation method, we use a stride of (1, n) to move the convolutional kernel one step at a time in the query dimension, but n steps at a time in the document dimension.This ensures that the convolutional kernel only operates over consecutive terms that existed in the original document.Thus, we end up with l g − 1 matrices C n lq×l d ×n f , and the original similarity matrix is directly employed to handle the signals over unigrams.
Two max pooling layers.The purpose of this step is to capture the n s strongest similarity signals for each query term.Measuring the similarity signals separately for each query term allows the model to consider query term coverage, while capturing the n s strongest similarity signals for each query term allows the model to consider signals from different kinds of relevance matching patterns, e.g., n-gram matching and non-contiguous matching.In practice, we use a small n s to prevent the model from being biased by document length; while each similarity matrix contains the same number of document term scores, longer documents have more opportunity to contain terms that are similar to query terms.To capture the strongest n s similarity signals for each query term, we first perform max pooling over the filter dimension n f to keep only the strongest signal from the n f different filters, assuming that there only exists one particular true matching pattern in a given n × n window, which serves different purposes compared with other tasks, such as the sub-sampling in computer vision.We then perform k-max pooling (Kalchbrenner et al., 2014) over the query dimension l q to keep the strongest n s similarity signals for each query term.Both pooling steps are performed on each of the l g − 1 matrices C i from the convolutional layer and on the original similarity matrix, which captures unigram matching, to produce the 3-dimensional tensor P lq×lg×ns .This tensor contains the n s strongest signals for each query term and for each n-gram size across all n f filters.
Recurrent layer for global relevance.Finally, our model transforms the query term similarity signals in P lq×lg×ns into a single document relevance score rel (q, d).It achieves this by applying a recurrent layer to P, taking a sequence of vectors as input and learning weights to transform them into the final relevance score.More precisely, akin to (Guo et al., 2016), the IDF of each query term q i is passed through a softmax layer for normalization.Thereafter, we split up the query term dimension to produce a matrix P lg×ns for each query term q i , subsequently forming the recurrent layer's input by flattening each matrix P lg×ns into a vector by concatenating the matrix's rows together and appending query term q i 's normalized IDF onto the end of the vector.This sequence of vectors for each query term q i is passed into a Long Short-Term Memory (LSTM) recurrent layer (Hochreiter and Schmidhuber, 1997) with an output dimensionality of one.That is, the LSTM's input is a sequence of query term vectors where each vector is composed of the query term's normalized IDF and the aforementioned salient signals for the query term along different kernel sizes.The LSTM's output is then used as our document relevance score rel (q, d).
Training objective.Our model is trained on triples consisting of a query q, relevant document d + , and non-relevant document d − , minimizing a standard pairwise max margin loss as in Eq. 1.

Evaluation
In this section, we empirically evaluate PACRR models using manual relevance judgments from the standard TREC Web Track.We compare them against several state-of-the-art neural IR models2 , including DRMM (Guo et al., 2016), DUET (Mitra et al., 2017), MatchPyramid (Pang et al., 2016), and K-NRM (Xiong et al., 2017).The comparisons are over three task settings: reranking search results from a simple initial ranker (RERANKSIMPLE); re-ranking all runs from the TREC Web Track (RERANKALL); and examining neural IR models' classification accuracy between document pairs (PAIRACCURACY).
Training.At each step, we perform Stochastic Gradient Descent (SGD) with a mini-batch of 32 triples.For the purpose of choosing the triples, we consider all documents that are judged with a label more relevant than Rel 7 as highly relevant, and put the remaining relevant documents into a relevant group.To pick each triple, we sample a relevance group with probability proportional to the number of documents in the group within the training set, and then we randomly sample a document with the chosen label to serve as the positive document d + .If the chosen group is the highly relevant group, we randomly sample a document from the relevant group to serve as the negative document d − .If the chosen group is the relevant group, we randomly sample a non-relevant document as d − .This sampling procedure ensures that we differentiate between highly relevant documents (i.e., those with a relevance label of HRel, Key or Nav) and relevant documents (i.e., those are labeled as Rel).The training continues until a given number of iterations is reached.The model is saved at every iteration.We use the model with the best ERR@20 on the validation set to make predictions.Proceeding in a round-robin manner, we report test results on one year by exploiting the respective remaining five years (250 queries) for training.From these 250 queries, we reserve 50 random queries as a held-out set for validation and hyper-parameter tuning, while the remaining 200 queries serve as the actual training set.
As mentioned, model parameters and training iterations are chosen by maximizing the ERR@20 on the validation set.The selected model is then used to make predictions on the test data.An example of this training procedure is shown in Figure 2.There are four hyper-parameters that govern the behavior of the proposed PACRR-kwindow and PACRR-firstk: the unified length of the document dimension l d , the k-max pooling size n s , the maximum n-gram size l g , and the number of filters used in convolutional layers n f .Due to limited computational resources, we determine the range of hyper-parameters to consider based on pilot experiments and domain insights.In particular, we evaluate l d ∈ [256,384,512,640,768], n s ∈ [1, 2, 3, 4], and l g ∈ [2, 3, 4].Due to the limited possible matching patterns given a small kernel size (e.g., l g = 3), n f is fixed to 32.For PACRR-firstk, we intuitively desire to retain as much information as possible from the input, and thus l d is always set to 768.DRMM (DRMM LCH×IDF ), DUET, Match-Pyramid and K-NRM are trained under the same settings using the hyperparameters described in their respective papers.In particular, as our focus is on the deep relevance matching model as mentioned in Section 1, we only compare against DUET's local model, denoted as DUETL.In addition, K-NRM is trained slightly different from the one described in (Xiong et al., 2017), namely, with a frozen word embedding layer.This is to guarantee its fair comparison with other models, given that most of the compared models can be enhanced by co-training the embedding layers, whereas the focus here is the strength coming from the model architecture.A fully connected middle layer with 30 neurons is added to compensate for the reduction of trainable parameters in K-NRM, mirroring the size of DRMM's first fully connected layer.
All models are implemented with Keras (Chollet et al., 2015) using Tensorflow as backend, and are trained on servers with multiple CPU cores.In particular, the training of PACRR takes 35 seconds per iteration on average, and in total at most 150 iterations are trained for each model variant.

RERANKSIMPLE.
We first examine the proposed model by re-ranking the search results from the QL baseline on Web Track 2012-14.The results are summarized in Table 1.It can be seen that DRMM can significantly improve QL on WT12 and WT14, whereas MatchPyramid fails on WT12 under ERR@20.While DUETL and K-NRM can consistently outperform QL, the two variants of PACRR are the only models that can achieve significant improvements at a 95% significance level on all years under both ERR@20 and nDCG@20.More remarkably, by solely re-ranking the search results from QL, PACRR-firstk can already rank within the top-3 participating systems on all three years as measured by both ERR and nDCG.The re-ranked search results from PACRR-kwindow also ranks within the top-5 based on nDCG@20.On average, both PACRR-kwindow and PACRRfirstk achieve 60% improvements over QL.
RERANKALL.In this part, we would like to further examine the performance of the proposed models in re-ranking different sets of search results.Thus, we extend our analysis to re-rank search results from all submitted runs from six years of the TREC Web Track ad-hoc task.In particular, we only consider the judged documents from TREC, which loosely correspond to top-20 documents in each run.The tested models make predictions for individual documents, which are used to re-rank the documents within each submitted run.Given that there are about 50 runs for each year, it is no longer feasible to list the scores for each re-ranked run.Instead, we summarize the results by comparing the performance of each run before and after re-ranking, and provide statistics over each year to compare the methods under consideration in Table 2.In the top portion of Table 2, we report the relative changes in metrics before and after re-ranking in terms of percentages ("average ∆ measure score").In the bottom portion, we report the percentage of systems whose results have increased after re-ranking.Note that these results assess two different aspects: the average ∆ measure score in  1: ERR@20 and nDCG@20 on TREC Web Track 2012-14 when re-ranking search results from QL.The comparisons are conducted between two variants of PACRR and DRMM (D/d), DUETL (L/l), MatchPyramid (M/m) and K-NRM (K/k).All methods are compared against the QL (Q/q) baseline.
The upper/lower-case characters in the brackets indicate a significant difference under two-tailed paired Student's t-tests at 95% or 90% confidence levels relative to the corresponding approach.In addition, the relative ranks among all runs within the respective years according to ERR@20 and nDCG@20 are also reported directly after the absolute scores.Table 2: The average statistics when re-ranking all runs from the TREC Web Track 2009-14 based on ERR@20 and nDCG@20.The average differences of the scores for individual runs are reported in the top portion.The comparisons are conducted between two variants of PACRR and DRMM (D/d), DUETL (L/l), MatchPyramid (M/m) and K-NRM (K/k).The upper/lower-case characters in parentheses indicate a significant difference under two-tailed paired Student's t-tests at 95% or 90% confidence levels, respectively, relative to the corresponding approach.The percentage of runs that show improvements in terms of a measure is summarized in the bottom portion.
improvement can be achieved over runs from different systems.In other words, the former measures the strength of the models, while the latter measures the adaptability of the models.Both PACRR variants improve upon existing rankings by at least 10% across different years.Remarkably, in terms of nDCG@20, at least 80% of the submitted runs are improved after re-ranking by the proposed models on individual years, and on 2010-12, all submitted runs are consistently improved by PACRR-firstk.Moreover, both variants of PACRR can significantly outperform all baseline models on at least three years out of the six years in terms of average improvement.However, it is clear that none of the tested models can make consistent improvements over all submitted runs across all six years.In other words, there still exist document pairs that are predicted contradicting to the judgments from TREC.Thus, in the next part, we further investigate the performance in terms of prediction over document pairs.

PAIRACCURACY. The ranking of documents can
Table 3: Comparison among tested methods in terms of accuracy when comparing document pairs with different labels.The "volume" column indicates the percentage of occurrences of each label combination out of the total pairs.The "# Queries" column records the number of queries that include a particular label combination.The comparisons are conducted between two variants of PACRR and DRMM (D/d), DUETL (L/l), MatchPyramid (M/m) and K-NRM (K/k).The upper/lower-case characters in parentheses indicate a significant difference under two-tailed paired Student's t-tests at 95% or 90% confidence levels, respectively, relative to the corresponding approach.In the last row, the average accuracy among different kinds of label combinations is computed, weighted by their corresponding volume.
be decomposed into rankings of document pairs as suggested in (Radinsky and Ailon, 2011).Specifically, a model's retrieval quality can be examined by checking across a range of individual document pairs, namely, how likely a model can assign a higher score for a more relevant document.Thus, it is possible for us to compare different models over the same set of complete judgments, removing the issue of different initial runs.Moreover, although ranking is our ultimate target, a direct inspection of pairwise prediction results can indicate which kinds of document pairs a model succeeds at or fails on.We first convert the graded judgments from TREC into ranked document pairs by comparing their labels.Document pairs are created among documents that have different labels.
A prediction is counted as correct if it assigns a higher score to the document from the pair that is labeled with a higher degree of relevance.The judgments from TREC contain at most six relevance levels, and we merge and unify the original levels from the six years into four grades, namely, Nav, HRel, Rel and NRel.We compute the accuracy for each pair of labels.The statistics are summarized in Table 3.The volume column lists the percentage of a given label combination out of all document pairs, and the # query column provides the number of queries for which the label combination exists.In Table 3, we observe that both PACRR models always perform better than all baselines on label combinations HRel vs. NRel, Rel vs. NRel and Nav vs. NRel, which in total cover 90% of all document pairs.Meanwhile, apart from Nav-Rel, there is no significant difference when distinguishing Nav from other types.K-NRM and DRMM perform better than the other two baseline models.

Discussion
Hyper-parameters.As mentioned, models are selected based on the ERR@20 over validation data.Hence, it is sufficient to use a reasonable and representative validation dataset, rather than handpicking a specific set of parameter settings.However, to gain a better understanding of the influence of different hyper-parameters, we explore PACRR-kwindow's effectiveness when several hyper-parameters are varied.The results when re-ranking QL search results are given in Figure 3.The results are reported based on the models with the highest validation scores after fixing certain hyper-parameters.For example, the ERR@20 in the leftmost figure is obtained when fixing l d to the values shown.The crosses in Figure 3 correspond to the models that were selected for use on the test data, based on their validation set scores.It can be seen that the selected models are not necessarily the best model on the test data, as evidenced by the differences between validation and test data results, but we consistently obtain scores within a reasonable margin.Owing to space constraints, we omit the plots for PACRR-firstk.
Choice between kwindow and firstk approaches.
As mentioned, both PACRR-kwindow and PACRRfirstk serve to address the variable-length challenge for documents and queries, and to make the Figure 3: The ERR@20 of re-ranked QL with PACRR-kwindow when applying different hyperparameters: l d , n s and l g .The x-axis reflects the settings for hyper-parameters, and the y-axis is the ERR@20.Crosses correspond to the selected models.
training feasible and more efficient.In general, if both training and test documents are known to be short enough to fit in memory, then PACRR-firstk can be used directly.Otherwise, PACRR-kwindow is a reasonable choice to provide comparable results.Alternatively, one can regard this choice as another hyper-parameter, and make a selection based on held-out validation data.
Accuracy in PAIRACCURACY.Beyond the observations in Section 3.2, we further examine the methods' accuracy over binary judgments by merging the Nav, HRel and Rel labels.The accuracies become 73.5%, 74.1% and 67.4% for PACRRkwindow, PACRR-firstk, and DRMM, respectively.Note that the manual judgments that indicate a document as relevant or non-relevant relative to a given query contain disagreements (Carterette et al., 2008;Voorhees, 2000) and errors (Alonso and Mizzaro, 2012).In particular, a 64% agreement (cf.Table 2 (b) therein) is observed over the inferred relative order among document pairs based on graded judgments from six trained judges (Carterette et al., 2008).When reproducing TREC judgments, Al-Maskari et al. (Al-Maskari et al., 2008) reported a 74% agreement (cf.Table 1 therein) with the original judgments from TREC when a group of users re-judged 56 queries on the TREC-8 document collections.Meanwhile, Alonso and Mizzaro (Alonso and Mizzaro, 2012) observed a 77% agreement relative to judgments from TREC when collecting judgments via crowdsourcing.Therefore, the more than 73% agreement achieved by both PACRR methods is close to the aforementioned agreement levels among different human assessors.However, when distinguishing Nav, HRel, and Rel, the tested models still fall significantly short of the human judges' agreement levels.These distinctions are important for a successful ranker, especially when measuring with graded metrics such as ERR@20 and nDCG@20.Hence, further research is needed for better discrimination among relevant documents with different degrees of relevance.In addition, as for the distinction between Nav documents and Rel or HRel documents, we argue that since Nav actually indicates that a document mainly satisfies a navigational intent, this makes such documents qualitatively different from Rel and HRel documents.Specifically, a Nav is more relevant for a user with navigational intent, whereas for other users it may in some cases be less useful than a document that directly includes highly pertinent information content.Therefore, we hypothesize that further improvements can be obtained by introducing a classifier for user intents, e.g., navigational pages, before employing neural IR models.

Conclusion
In this work, we have demonstrated the importance of preserving positional information for neural IR models by incorporating domain insights into the proposed PACRR model.In particular, PACRR captures term dependencies and proximity through multiple convolutional layers with different sizes.Thereafter, following two max-pooling layers, it combines salient signals over different query terms with a recurrent layer.Extensive experiments show that PACRR substantially outperforms four state-of-the-art neural IR models on TREC Web Track ad-hoc datasets and can dramatically improve search results when used as a reranking model.

Figure 1 :
Figure 1: The pipeline of PACRR.Each query q and document d is first converted into a query-document similarity matrix sim |q|×|d| .Thereafter, a distillation method (firstk is displayed) transforms the raw similarity matrix into unified dimensions, namely, sim lq×l d .Here, l g − 1 convolutional layers (CNN) are applied to the distilled similarity matrices.As l g = 3 is shown, layers with kernel size 2 and 3 are applied.Next, max pooling is applied, leading to l g matrices C 1 • • • C lg .Following this, n s -max pooling captures the strongest n s signals over each query term and n-gram size, and the case for n s = 2 is shown here.Finally, the similarity signals from different n-gram sizes are concatenated, the query terms' normalized IDFs are added, and a recurrent layer combines these signals for each query term into a query-document relevance score rel (q, d).

Figure 2 :
Figure 2: The training loss, ERR@20 and nDCG@20 per iteration on validation data when training on Web Track 2010-14.The x-axis denotes the iterations.The y-axis indicates the ERR@20/nDCG@20 (left) and the loss (right).The best performance appears on 109th iteration with ERR@20=0.242.The lowest training loss (0.767) occurs after 118 iterations.

Table
Table 2 captures the degree to which a model can improve an initial run, while the percentages of runs indicate to what extent an