Retrieval of Research-level Mathematical Information Needs: A Test Collection and Technical Terminology Experiment

In this paper, we present a test collection for mathematical information retrieval composed of real-life, research-level mathematical information needs. Topics and relevance judgements have been procured from the on-line collaboration website MathOverﬂow by delegating domain-speciﬁc decisions to experts on-line. With our test collection, we construct a baseline using Lucene’s vector-space model implementation and conduct an experiment to investigate how prior extraction of technical terms from mathematical text can affect retrieval efﬁciency. We show that by boosting the importance of technical terms, statistically signiﬁcant improvements in retrieval performance can be obtained over the baseline.

Recent interest in Mathematical information retrieval (MIR) has prompted the construction of the NTCIR Math IR test collection (Aizawa et al., 2013). Like many general-purpose, domainspecific IR test collections, the NTCIR collection is composed of broad queries intended to test systems over a wide spectrum of query complexity.
In this paper we present a test collection composed of real-life, research-level mathematical topics and associated relevance judgements procured from the online collaboration web-site MathOverflow 1 . The resulting test collection con-1 http://mathoverflow.net/ tains 160 atomic questions -material derived from 120 MathOverflow discussion threads.
Topics in our test collection capture specialised information needs that are complex to resolve and often demand collective effort from multiple domain experts. For example 2 : The "most symmetric" Mukai-Umemura 3-fold with automorphism group P GL(2, C) admits Due to their specialised nature, our topics have a relatively small number of relevant documents. Fortunately, there is precedent of this from IR tasks such as QA (Ishikawa et al., 2010) and known-item search (Craswell et al., 2003).
With our test collection, we construct a baseline using Lucene's default implementation of the vector space model (VSM). Additionally, we conduct an experiment designed to investigate the hypothesis that technical terms in mathematics have elevated retrieval significance.
Information in mathematics is communicated by defining, manipulating and otherwise operating on mathematical structures and objects which can be instantiated in the mathematical discourse. In this sense, technical terminology in mathematics has an elevated role. This hypothesis stems from the observation that the mathematical discourse is dense with named mathematical objects, structures, properties and results.
In the next section, we present our test collection and discuss the procedure for its construction from crowd-sourced expertise on MathOverflow. In section 3, we discuss related material in the literature and compare it to our work. Our experimental setup and results are discussed in section 4, with a brief summary of our work presented in section 5.

The Test Collection
The main motivation behind this work comes from our long-term goal to develop and evaluate MIR models intended to satisfy research-level mathematical information needs. Evaluation is an important final step in the development of IR models and is preconditioned on the availability of a test collection.
A test collection is a resource composed of (1) a document collection (or corpus) with uniquely identifiable documents (e.g., scientific papers, news articles), (2) a set of topics from which search queries can be produced and (3) a set of relevance judgements: pairs connecting individual topics to documents (in the corpus) known to satisfy the corresponding information need.
General-purpose MIR test collections, such as the one produced for NTCIR-10 ( Aizawa et al., 2013), are expected to contain both broad and narrow topics capturing a wide range of retrieval complexity. In contrast, we require a collection of topics characterised by a higher lower bound on topic complexity with individual topics capturing highly-specialised, real-world information needs.
Unfortunately, research-level mathematical information needs are hard to source from documents in a way that would not render them artificial. Furthermore, manual construction of topics and relevance judgements is unrealistic due to the large number of experts required to cover the various specialised sub-fields of mathematics. This, coupled with limited access to numerous MIR systems, makes TREC-like pooling (Harman, 1993;Voorhees and Harman, 2005) impractical.
We propose that topics and relevance judgements be procured from the on-line collaboration website MathOverflow (MO), an online QA site for research mathematicians. A user (information seeker) can post a question on the site, usually relating to a small niche field in mathematics. Colleagues can either post a candidate answer, comment on the question, comment on and/or up-Prelude 1) Apparently, physicist can calculate the GW invariants of quintic CY 3-fold up to genus 51. 2) For each genus g, there is a lower bound d(g) such that for every d < d(g), all genus g degree d invariants of quintic are zero. MT-1 I am looking for a reference that has a table of these number for some low degrees (say up to degree 5) and low genera (at least until g = 3).

MT-2
Where can I found this lower bound? Table 1: MO post 14655, prelude and micro-topics vote existing answers. Ultimately, the information seeker decides which answer satisfies the underlying information need by marking it as "accepted". Material on MO is closely aligned with our requirements. Specifically, Tausczik et al. (2014) and Martin and Pease (2013) agree that MO questions (information needs) arise from doing mathematics research and are novel to the mathematician involved. The authors conclude that, having been produced by experts, MO answers are authoritative and partially credit the website's reward system for their strong reliability.
MO questions often have multiple sub-parts, which we refer to as micro-topics since they encode atomic information needs. Furthermore, information in MO questions is carried by two types of sentences: prelude sentences, which are used to set the mathematical context (introduce mathematical constructs and results) and query sentences, which transcribe the information need itself and are semantically bound to the accompanying prelude.
As the underlying document collection, we have used the Mathematical Retrieval Corpus (MREC) 3 (Líška et al., 2011), which contains more than 439,000 mathematical publications, complete with mathematical formulae converted to machinereadable MathML. Similarly, we have made mathematical expressions in our topics accessible to MIR systems by converting all L A T E X embedded in MO questions into MathML using the LaTeXML tool-kit.
For the purpose of constructing our test collection we have adopted a multi-step process. All steps in the process are systematically applicable regardless of the subject material of the topic being considered for inclusion. As such, our test collection can be as diverse, in terms of mathematical subject and sub-fields, as MathOverflow.
Decisions relating to relevance of material to a given topic (MO question) are delegated to experts on the website. However, the information seeker (MO user posting the question) remains the ultimate judge of relevance. This authority is typically exercised by either accepting an answer directly or, by explicitly commenting on the relevance of posted material.
In the first step, all MO discussion threads 4 with at least one citation to the MREC in their accepted answer were collected. Each identified thread was examined by one of the authors for conformance to two ideal-standard criteria: (1) Useful MO questions should not be too broad or vague but rather express an information need that is clear and can be satisfied by describing objects or properties, stating conditions and/or producing examples or counter-examples. (2) MREC documents cited in MO accepted answers should address all sub-parts of the question in a manner that requires minimal deduction and do not synthesise mathematical results from multiple resources.
Subsequently, relevance of documents for each micro-topic is decided using two criteria: totality and directness. A cited resource is total if it contains all necessary information to derive the answer for the micro-topic and partial if it only addresses a special case. A cited resource is also said to be direct, if the answer can be derived with little intellectual effort from its text, or indirect if the same information requires considerable effort (such as mathematical deduction or reasoning) for the information seeker to reproduce.
Making these determinations involves matching the language of arguments and the symbolic context of the answer to the cited resource. As part of this step, we also examine the post-answer (PA) comments for expressions of confirmation of the usefulness of a cited resource from the information seeker.
The completed test collection contains 160 micro-topics with 184 associated relevance judgements (involving 224 unique MREC documents) organised in 120 topics. Topic text in our test collection is sentence tokenised, with relevance judgements being represented conceptually as tuples of the form:

Related Work
Test collections over scientific publications were first introduced for the Cranfield experiments (Cleverdon, 1960;Cleverdon, 1962;Cleverdon et al., 1966a;Cleverdon et al., 1966b). Despite criticism for sourcing queries from collection documents, the Cranfield experiments highlighted the importance of jointly reporting recall and precision, pioneered the practice of using authors and citations for augmenting relevance judgements and established the test collection paradigm.
Expert citations have already been exploited for procuring relevance judgements. For example, Ritchie et al. (2006) elicited relevance judgements for citations in papers accepted in a scientific conference from their authors and used these judgements as part of their test collection of scientific publications.
In terms of domain, our work is related to the NTCIR-10 Math IR test collection (Aizawa et al., 2013). Furthermore, the topics in our collection are analogous to those in the NTCIR full-text search, in the sense that they take the form of coherent text interspersed with mathematical expressions. Rather than being focused on accommodating information needs of varying complexity, however, our test collection has been designed to facilitate retrieval of highly specialised, mathematical information needs of uniformly high complexity.
Similar use of crowd-sourced expertise has been proposed in the context of QA. For example, Gyongyi et al. (2008), examined 10 months-worth of "Yahoo! Answers" material as part of an investigation of QA data, which was later used for the NTCIR-8 Community QA pilot task (Ishikawa et al., 2010;Sakai et al., 2011). Characterisation of crowd-sourced answers in terms of totality (section 2) has also been considered in the context of QA. In particular, Sakai et al. (2011) describe a relevance grading scheme of crowd-sourced answers based on the total/partial/irrelevant scale, but highlight that answers on "Yahoo! Answers" vary in quality (e.g., due to instances of bias or obscenity).
Finally, the idea of sourcing relevance judgements from expert citations is an established practice in IR. In the context of patent search, for example, Graf and Azzopardi (2008) utilised citations in patent office expert reports as relevance judgements, while Fujii et al. (2006) automatically extracted patent office expert citations used to reject patent applications.

Experiments
In this section we conduct an experiment to demonstrate the usefulness of our test collection by investigating the impact of terminology boosting on MIR effectiveness. An important assumption of this experiment is that the retrieval of each micro-topic is dependent only on the attached prelude.

Experimental Setup
We first produced a Lucene index over all documents in the MREC. In order to normalise processing of XHTML+MathML, topics and MREC documents were passed through the Tika framework 5 . Lucene's StandardAnalyzer was modified to preserve stop-words since frequent words such as the preposition "of" can be important parts of technical terms (e.g., "set of vectors"). The analyzer was also modified to preserve dashes, which are common in technical terms (e.g., "Calabi-Yau manifold"). This analyzer is used during both indexing and query processing for consistency.

Building Queries
For each micro-topic in a given topic, we emit a query string by concatenating all sentences in the prelude with sentences associated with the microtopic. For example, query string for micro-topic MT-1 in Table 1 is generated by concatenating its text with that of the prelude. Using this strategy, consistency with the assumption outlined at 5 https://tika.apache.org/ the beginning of the section is achieved since no overlap beyond the prelude is introduced between queries generated for micro-topics attached to a given topic.

Systems
Using Lucene as the indexing and searching backend, we compare the performance of two retrieval methods. Underpinning both methods is Lucene's default similarity (project, 2013), which is based on cosine similarity: where V (q) and V (d) are weighted vectors for the query and candidate document respectively. As a performance measure, we use mean average precision (MAP):

Baseline
Lucene's VSM implementation with default TF-IDF weighting and scoring is used as the baseline. This is intended to emulate a general-purpose information retrieval scenario, which is the motivation behind the design of Lucene's default configuration.

Boosted Technical Terms
The alternative model is designed to give more weight to technical terminology common to both documents and queries. In order to construct this model, all technical terms are extracted from the document collection using an implementation of the C-Value multi-word technical term extraction method (Frantzi et al., 1996;Frantzi et al., 1998). Given an input corpus, the C-Value method extracts multi-word terms by making use of a linguistic and a statistical component. The linguistic component is responsible for eliminating multi-term strings that are unlikely to be technical terms through the use of a stop-list (composed of high-frequency corpus terms) and linguistic filters (regular expressions) applied on sequences of part-of-speech tags. The statistical component assigns a "termhood" score to a candidate term sequence based on corpus-wide statistical characteristics of the sequence itself and those of term sequences that contain it. The output of  Original Term vector (a,2), (Riemannian,1),(manifold,2),(is,1),(smooth,1) Technical terms Riemannian manifold, smooth manifold Re-Attributed Term Vector (a,2), (Riemannian manifold,1),(is,1),(smooth manifold,1) Re-generated delta index text a a a a Riemannian manifold Riemannian manifold is is smooth manifold smooth manifold Table 4: Example of re-attribution and delta index the algorithm is a list of candidate technical terms in the corpus, ordered by their C-Value termhood score. As shown in Table 3, each entry in the resulting list represents a single technical term (the class) and enumerates all forms of the candidate term as observed in the input corpus. In total, 3 million classes of technical terms have been detected in the MREC. Using Lucene's positional indexing mechanism, we retrieved the position of each technical term (all forms), recorded its term frequency (TF) and produced a new technical term index. This technical term index contains 426 million tuple entries of the form The same re-indexing process is repeated for the queries and the result is stored in a separate query table (10,433 entries). Subsequently, the indexed document and query term vectors were modified by (1) adding new tokens to represent technical term phrases and (2) reattributing the TF of component terms to the term vector of the phrase.
Finally, the text for each MREC document and query is re-generated from the term vectors and stored in a "delta index". At this stage, the number of technical term instances emitted is twice that recorded by the original term vector. This has the effect of boosting the significance of technical terms and phrases. An example of the application of this process, from original text to delta index generation is presented in Table 4. Rankings for the alternative model can be obtained by searching the delta index using the re-generated query.  Although the choice of boosting factor 2 is arbitrary, our intention is to demonstrate the presence of a difference in retrieval efficiency, rather than optimising the effect of boosting.

Results
The MAP scores obtained for the models are presented in Table 5. We observe that the difference in MAP is in favour of the alternative model. This difference is statistically significant at α = 0.05 using the Wilcoxon signed-rank test (p < 0.05). Therefore, we have sufficient evidence to conclude that, in the context of the VSM, boosting technical terms improves retrieval efficiency of research mathematics.
When compared to MAP scores produced by the same systems in more traditional IR tasks, the scores in Table 5 may seem poor. We attribute this phenomenon to the fact that sense in written mathematics is communicated via a complex interaction of text and mathematical expressions and is thus hard to extract using shallow methods.

Conclusions and Further Work
We have constructed a Math IR test collection for real-life, research-level mathematical information needs. As part of the work of constructing our test collection, we have developed a methodology for compiling domain-specific test collections that requires minimal expertise in the domain itself.
Using 160 micro-topics in our test collection, we have shown experimentally that the performance of VSM-based retrieval models with research mathematics can be improved by boosting the importance of technical terminology. Furthermore, our experimental work suggests that our test collection can be used to identify statistically significant differences between MIR systems. It is our intention to make our collection available to the IR community.
As part of on-going and future work, we will be incorporating additional retrieval models, such as the Okapi BM25, in our evaluation framework. In addition, we are looking into investigating the statistical properties of our test collection along the lines of Harman (2011) and Soboroff et al. (2001).