Integrating Query Performance Prediction in Term Scoring for Diachronic Thesaurus

A diachronic thesaurus is a lexical resource that aims to map between modern terms and their semantically related terms in earlier periods. In this paper, we investigate the task of collecting a list of relevant modern target terms for a domain-speciﬁc diachronic thesaurus. We propose a supervised learning scheme, which integrates features from two closely related ﬁelds: Terminology Extraction and Query Performance Prediction (QPP). Our method further expands modern candidate terms with ancient related terms, before assessing their corpus relevancy with QPP measures. We evaluate the empirical beneﬁt of our method for a thesaurus for a diachronic Jewish corpus.


Introduction
In recent years, there has been growing interest in diachronic lexical resources, which comprise terms from different language periods. (Borin and Forsberg, 2011;Riedl et al., 2014). These resources are mainly used for studying language change and supporting searches in historical domains, bridging the lexical gap between modern and ancient language.
In particular, we are interested in this paper in a certain type of diachronic thesaurus. It contains entries for modern terms, denoted as target terms. Each entry includes a list of ancient related terms. Beyond being a historical linguistic resource, such thesaurus is useful for supporting searches in a diachronic corpus, composed of both modern and ancient documents. For example, in our historical Jewish corpus, the modern Hebrew term for terminal patient 1 has only few verbatim occurrences, in modern documents, but this topic has been widely discussed in ancient periods. A domain searcher needs the diachronic thesaurus to enrich the search with ancient synonyms or related terms, such as dying and living for the moment.
Prior work on diachronic thesauri addressed the problem of collecting relevant related terms for given thesaurus entries. In this paper we focus on the complementary preceding task of collecting a relevant list of modern target terms for a diachronic thesaurus in a certain domain. As a starting point, we assume that a list of meaningful terms in the modern language is given, such as titles of Wikipedia articles. Then, our task is to automatically decide which of these candidate terms are likely to be relevant for the corpus domain and should be included in the thesaurus. In other words, we need to decide which of the candidate modern terms corresponds to a concept that has been discussed significantly in the diachronic domain corpus.
Our task is closely related to term scoring in the known Terminology Extraction (TE) task in NLP. The goal of corpus-based TE is to automatically extract prominent terms from a given corpus and score them for domain relevancy. In our setting, since all the target terms are modern, we avoid extracting them from the diachronic corpus of modern and ancient language. Instead, we use a given candidate list and apply only the term scoring phase. As a starting point, we adopt a rich set of state-of-the-art TE scoring measures and integrate them as features in a common supervised classification approach (Foo and Merkel, 2010;Zhang et al., 2010;Loukachevitch, 2012).
Given our Information Retrieval (IR) motivation, we notice a closely related task to TE, namely Query Performance Prediction (QPP). QPP methods are designed to estimate the retrieval quality of search queries, by assessing their relevance to the text collection. Therefore, QPP scoring measures seem to be potentially suitable also for our terminology scoring task, by considering the candidate term as a search query. Some of the QPP measures are indeed similar in nature to the TE methods, analyzing the distribution of the query terms within the collection. However, some of the QPP methods have different IR-biased characteristics and may provide a marginal contribution. Therefore, we adopted them as additional features for our classifier and indeed observed a performance increase.
Most of the QPP methods prioritize query terms with high frequency in the corpus. However, in a diachronic corpus, such criterion may sometimes be problematic. A modern target term might appear only in few modern documents, while being referred to, via ancient terminology, also in ancient documents. Therefore, we would like our prediction measure to be aware of these ancient documents as well. Following a particular QPP measure (Zhou and Croft, 2007), we address this problem through Query Expansion (QE). Accordingly, our method first expands the query containing the modern candidate term, then calculates the QPP scores of the expanded query and then utilizes them as scoring features. Combining the baseline features with our expansion-based QPP features yields additional improvement in the classification results.

Term Scoring Measures
This section reviews common measures developed for Terminology Extraction (Section 2.1) and for Query Performance Prediction (Section 2.2). Table 1 lists those measures that were considered as features in our system, as described in Section 3.

Terminology Extraction
Terminology Extraction (TE) methods aim to identify terms that are frequently used in a specific domain. Typically, linguistic processors (e.g. POS tagger, phrase chunker) are used to filter out stop words and restrict candidate terms to nouns or noun phrases. Then, statistical measures are used to rank the candidate terms. There are two main terminological properties that the statistical measures identify: unithood and termhood. Measures that express unithood indicate the collocation strength of units that comprise a single term. Measures that express termhood indicate the statistical prominence of the term in the target do-main corpus. For our task, we focus on the second property, since the candidates are taken from a key-list of terms whose coherence in the language is already known. Measures expressing termhood are based either on frequency in the target corpus (1,2,3,4,9,11,12,13) 2 , or on comparison with frequency in a reference background corpus (8,14,16). Recently, approaches which combine both unithood and termhood were investigated as well (7,8,15,16).

Query Performance Prediction
Query Performance Prediction (QPP) aims to estimate the quality of answers that a search system would return in response to a particular query. Statistical QPP methods are categorized into two types: pre-retrieval methods, analyzing the distribution of the query term within the document collection; and post-retrieval methods, additionally analyzing the search results. Some of the preretrieval methods are similar to TE methods based on the same term frequency statistics.
Pre-retrieval methods measure various properties of the query: specificity (17,18,24,25), similarity to the corpus (19), coherence of the documents containing the query terms (26), variance of the query terms' weights over the documents containing it (20); and relatedness, as good performance is expected when the query terms co-occur frequently in the collection (21).
Post-retrieval methods are usually more complex, where the top search results are retrieved and analyzed. They are categorized into three main paradigms: clarity-based methods (28), robustness-based methods (22) and score distribution based methods (23, 29).
We pay special attention to two post-retrieval QPP methods; Query Feedback (22) and Clarity (23). The Clarity method measures the coherence of the query's search results with respect to the corpus. It is defined as the KL divergence between a language model induced from the result list and that induced from the corpus. The Query Feedback method measures the robustness of the query's results to query perturbations. It models retrieval as a communication channel. The input is the query, the channel is the search system, and the set of results is the noisy output of the channel. A new query is generated from the list of search  (Liu et al., 2005) 13 Term Variance Quality (Liu et al., 2005) 6 TF-Disjoint Corpora Frequency (Lopes et al., 2012) 14 Weirdness (Ahmad et al., 1999) 7 C-value (Frantzi andAnaniadou, 1999) 15 NC-value (Frantzi and Ananiadou, 1999) 8 Glossex (Kozakov et al., 2004) 16 TermExtractor (Sclano and Velardi, 2007) Query Performance Prediction measures 17 Average IDF  24 Average ICTF (Inverse collection term frequency) (Plachouras et al., 2004) 18 Query Scope  25 Simplified Clarity Score  19 Similarity Collection Query (Zhao et al., 2008) 26 Query Coherence (He et al., 2008) 20 Average Variance (Zhao et al., 2008) 27 Average Entropy (Cristina, 2013) 21 Term Relatedness (Hauff et al., 2008) 28 Clarity (Cronen-Townsend et al., 2002) 22 Query Feedback (Zhou and Croft, 2007) 29 Normalized Query Commitment (Shtok et al., 2009) 23 Weighted Information Gain (Zhou and Croft, 2007) Table 1: Prior art measures considered in our work results, taking the terms with maximal contribution to the Clarity score, and then a second list of results is retrieved for that second query. The overlap between the two lists is the robustness score. Our suggested method was inspired by the Query Feedback measure, as detailed in the next section.

Integrated Term Scoring
We adopt the supervised framework for TE (Foo and Merkel, 2010;Zhang et al., 2010;Loukachevitch, 2012), considering each candidate target term as a learning instance. For each candidate, we calculate a set of features over which learning and classification are performed. The classification predicts which candidates are suitable as target terms for the diachronic thesaurus. Our baseline system (TE) includes state-of-the-art TE measures as features, listed in the upper part of Table 1.
Next, we introduce two system variants that integrate QPP measures as additional features. The first system, TE-QPP T erm , applies the QPP measures to the candidate term as the query. All QPP measures, listed in the lower part of Table 1, are utilized except for the Query Feedback measure (22) (see below). To verify which QPP features are actually beneficial for terminology scoring, we measure the marginal contribution of each feature via ablation tests in 10-fold cross validation over the training data (see Section 4.1). Features which did not yield marginal contribution were not included 3 . 3 Removed features from TE-QPPT erm: 17, 19, 22, 23, The two systems, described so far, rely on corpus occurrences of the original candidate term, prioritizing relatively frequent terms. In a diachronic corpus, however, a candidate term might be rare in its original modern form, yet frequently referred to by archaic forms. Therefore, we adopt a query expansion strategy based on Pseudo Relevance Feedback, which expands a query based on analyzing the top retrieved documents. In our setting, this approach takes advantage of a typical property of modern documents in a diachronic corpus, namely their temporally-mixed language. Often, modern documents in a diachronic domain include ancient terms that were either preserved in modern language or appear as citations. Therefore, an expanded query of a modern term, which retrieves only modern documents, is likely to pick some of these ancient terms as well. Thus, the expanded query would likely retrieve both modern and ancient documents and would allow QPP measures to evaluate the query relevance across periods.
Therefore, our second integrated system, TE-QPP QE , utilizes the Pseudo Relevance Feedback Query Expansion approach to expand our modern candidate with topically-related terms. First, similarly to the Query Feedback measure (measure (22) in the lower part of Table 1), we expand the candidate by adding terms with maximal contribution (top 5, in our experiments) to the Clarity score (Section 2.2). Then, we calculate all QPP measures for the expanded query. Since the expan- 24, 25. sions that we extract from the top retrieved documents typically include ancient terms as well, the new scores may better express the relevancy of the candidate's topic across the diachronic corpus. We also performed feature selection, as done for the first system 4 .

Evaluation Setting
We applied our method to the diachronic corpus is the Responsa project Hebrew corpus 5 . The Responsa corpus includes rabbinic case-law rulings which represent the historical-sociological milieu of real-life situations, collected over more than a thousand years, from the 11 th century until today. The corpus consists of 81,993 documents, and was used for previous NLP and IR research (Choueka et al., 1971;Choueka et al., 1987;HaCohen-Kerner et al., 2008;Liebeskind et al., 2012;Zohar et al., 2013;.
The candidate target terms for our classification task were taken from the publicly available keylist of Hebrew Wikipedia entries 6 . Since many of these tens of thousands entries, such as person names and place names, were not suitable as target terms, we first filtered them by Hebrew Named Entity Recognition 7 and manually. Then, a list of approximately 5000 candidate target terms was manually annotated by two domain experts. The experts decided which of the candidates corresponds to a concept that has been discussed significantly in our diachronic domain corpus. Only candidates that the annotators agreed on their annotation were retained, and then balanced for equal number of positive and negative examples. Consequently, the balanced training and test sets contain 500 and 200 candidates, respectively.
For classification, Weka's 8 Support Vector Machine supervised classifier with polynomial kernel was used. We train the classifier with our training set and measure the accuracy on the test set.

Results
Table 2 compares the classification performance of our baseline (TE) and integrated systems, (TE-QPP T erm ) and (TE-QPP QE ), proposed in Section 3.

Feature Set
Accuracy (%) TE 61.5 TE-QPP T erm 65 TE-QPP QE 66.5 In general, additional QPP features increase the classification accuracy. Even though the improvement of the term-based QPP over the baseline is not statistically significant according to the Mc-Nemar's test (McNemar, 1947), on our diachronic corpus it seems to help. Yet, when the QPP score is measured over the expanded candidate, and ancient documents are utilized, the performance increase is more notable (5 points) and the improvement over the baseline is statistically significant according to the McNemar's test with p<0.05.
We analyzed the false negative classifications of the baseline that were classified correctly by the QE-based configuration. We found that their expanded forms contain ancient terms that help the system making the right decision. For example, the Hebrew target term for slippers was expanded by the ancient expression corresponding to made of leather. This is a useful expansion since in the ancient documents slippers are discussed in the context of fasts, as in two of the Jewish fasts wearing leather shoes is forbidden and people wear cloth-made slippers.

Conclusions and Future Work
We introduced a method that combines features from two closely related tasks, terminology extraction and query performance prediction, to solve the task of target terms selection for a diachronic thesaurus. In our diachronic setting, we showed that enriching TE measures with QPP measures, particularly when calculated on expanded candidates, significantly improves performance. Our results suggest that it may be worth investigating this integrated approach also for other terminology extraction and QPP settings.
We plan to further explore the suggested method by utilizing additional query expansion algorithms. In particular, to avoid expanding queries for which expansion degrade retrieval performance, we plan to investigate the selective query expansion approach (Cronen-Townsend et al., 2004).