Top-Rank-Focused Adaptive Vote Collection for the Evaluation of Domain-Specific Semantic Models

The growth of domain-specific applications of semantic models, boosted by the recent achievements of unsupervised embedding learning algorithms, demands domain-specific evaluation datasets. In many cases, content-based recommenders being a prime example, these models are required to rank words or texts according to their semantic relatedness to a given concept, with particular focus on top ranks. In this work, we give a threefold contribution to address these requirements: (i) we define a protocol for the construction, based on adaptive pairwise comparisons, of a relatedness-based evaluation dataset tailored on the available resources and optimized to be particularly accurate in top-rank evaluation; (ii) we define appropriate metrics, extensions of well-known ranking correlation coefficients, to evaluate a semantic model via the aforementioned dataset by taking into account the greater significance of top ranks. Finally, (iii) we define a stochastic transitivity model to simulate semantic-driven pairwise comparisons, which confirms the effectiveness of the proposed dataset construction protocol.


Introduction
In recent years, we have been witnessing a growth of Natural Language Processing (NLP) applications in a wide range of specific domains, such as recruiting (INDA; Qin et al., 2018), law (Sugathadasa et al., 2017), oil and gas (Nooralahzadeh et al., 2018), social media analysis (ALRashdi and O'Keefe, 2019), online education (Dessì et al., 2019), and biomedical (Patel et al., 2020). Embedding-based models have been playing a crucial role in this specialization, as they allow the application of the same learning algorithm to a variety of different corpora of unlabeled texts, obtaining domain-specific models (Bengio et al., 2003;Devlin et al., 2018;Mikolov et al., 2013aMikolov et al., , 2013bPennington et al., 2014).
The evaluation and validation of a domainspecialized model requires manually-annotated domain-specific datasets (Bakarov, 2018;Lastra-Díaz et al., 2019;. However, the construction of such datasets is a very resourceconsuming process, and particular care is needed to ensure their ability to evaluate the desired features (Bakarov, 2018;Wang et al., 2019). In particular, it is fundamental to carefully consider the so-called downstream task (i.e., the final purpose of the model), because the appropriate evaluation metric depends on this task (Bakarov, 2018;Blanco et al., 2013;Halpin et al., 2010;Rogers et al., 2018;Wang et al., 2019).
In view of the widespread of these applications, we propose a methodology to construct appropriate domain-specific datasets and metrics to assess the accuracy of relatedness and similarity estimations.
In particular, due to its suitability for non-expert human annotation, we mainly focus on semantic relatedness; however, the proposed protocol can be easily extended to semantic similarity.
A standard approach to evaluate a relatednessbased model is the comparison of the semantic ranking it produces with the corresponding ranking determined from human annotations. However, the relevance of rank mismatches may depend on the involved positions; in particular, top ranks are considered more important in many contexts, two prominent examples being content-based recommenders (De Gemmis et al., 2008Lops et al., 2011;Mladenic, 1999) and semantic matching (Giunchiglia et al., 2004;Li and Xu, 2014;Wan et al., 2016). The greater significance of top ranks compared with low ranks is actually a pretty common phenomenon, as it can be argued from the attempts to overweight the former in the context of ranking correlation (Blest, 2000;Pinto da Costa and Soares, 2005;Dancelli et al., 2013;Iman and Conover, 1987;Maturi and Abdelfattah, 2008;Shieh, 1998;Vigna, 2015;Webber et al., 2010).
Our contribution is framed within the requirement to create domain-specific datasets to evaluate semantic relatedness measure with particular focus on top ranks and is threefold. (i) In Section 2, we define a protocol for the construction, based on adaptive pairwise comparisons, of a relatednessbased evaluation dataset tailored on the available resources and optimized to be particularly accurate in top-rank evaluation. (ii) In Section 3, we define appropriate metrics to evaluate a semantic model via the aforementioned dataset by taking into account the greater significance of top ranks; the proposed metrics are extensions of well-known ranking correlation measures and they can be used to compare rankings, independently from their origin, whenever top ranks are particularly important. Finally, (iii) in Section 4.1, we define a stochastic model to simulate semantic-driven pairwise comparisons, whose predictions (described in Section 4.2) confirm the effectiveness of the proposed dataset construction protocol; more in detail, we adapt a stochastic transitivity model, originally defined in the context of comparative judgment, in order to make it suitable for either similarity-driven or relatedness-driven comparisons.

Dataset Construction
In this section, we describe and justify a methodology to construct a dataset for the evaluation of a domain-specific relatedness-based model. Relatedness-based evaluation -known as intrinsic evaluation in the context of embedding-based models -requires the construction of a dataset of human annotations, which may be collected via two different approaches. The former relies on a small group of linguistic experts to create a gold standard dataset, which is reliable but very expensive and, due to the subjectivity of relatedness and to the limited number of annotations, highly susceptible to bias and lack of statistical significance (Blanco et al., 2013;Faruqui et al., 2016). The latter relies on a large group of non-experts, typically associated with a crowdsourcing service (e.g., Amazon MTurk, ProlificAcademic, SocialSci, Crowd-Flower, ClickWorker, CrowdSource), it is typically more affordable, and it has been proven to be repeatable and reliable (Blanco et al., 2013).
In the next sections we describe and justify a protocol to construct a dataset based on semantic relatedness between pairs of tokens 1 collected via a crowdsourcing approach. To simplify the reading of the paper, Figure 1 shows the main steps for the practical construction of a dataset within the proposed approach, while Table 1 reports a summary of the most frequently used symbols.

Token Choice
The first step in the dataset construction is the choice of the tokens among which we want to estimate the semantic relatedness. These tokens must be carefully chosen to represent the semantic areas typically involved in the downstream tasks (Bakarov, 2018;Schnabel et al., 2015). Moreover, it is well-known that models based on highdimensional embeddings tend to incorrectly identify as a semantic nearest neighbor to almost any concept one of a few common tokens called hubs (Dinu et al., 2014;Feldbauer et al., 2018;Francois et al., 2007;Radovanović et al., 2010a,b). In order to detect this undesirable feature, which goes under the name of hubness problem, an evaluation dataset must contain a relevant amount of rare 2 tokens  (Bakarov, 2018;Blanco et al., 2013). Henceforth, we consider as a concrete example a content-based recommender system in the recruitment domain (INDA); in this case, the designated tokens can be chosen among hard/soft skills, job titles, and other tokens found in resumes and job descriptions, including a relevant fraction of rare tokens.
Another potential issue of using relatedness to evaluate semantic models, is associated to lexical ambiguity, i.e., to the lack of one-to-one correspondence between tokens and meanings (Bakarov, 2018;Faruqui et al., 2016;Wang et al., 2019). To mitigate this problem, we suggest identifying a number of relevant semantic areas within the domain of interest and subdivide the tokens accordingly. For instance, Sales & Marketing, Computerrelated, Workforce, and Work & Welfare are examples of semantic areas within the recruiting domain.

Token Pairing
The random sampling of pairs in the whole vocabulary is known to produce a large amount of unrelated pairs , in contrast with the desired focus on the most related pairs. A standard approach to overcome this problem is pair selection based on either known semantic relations Probability of oversight for voter v or the frequency of tokens' co-occurrence within a corpus of texts. While the former information may be a priori unknown within the domain of interest, the latter may produce a bias in favor of distributional methods that compute relatedness based on similar knowledge sources .
We suggest, therefore, separate token pairing in each of the semantic areas identified as described in Section 2.1: in this case, the relatedness distribution is substantially shifted towards larger values compared with random sampling in the whole vocabulary. This shift is shown in Figure 3 -based on a word embedding created with the word2vec algorithm (Mikolov et al., 2013a,b) trained on a corpus of resumes -where the relatedness distribution of pairs of distinct tokens selected within the same semantic area 3 (red plus) is compared with that of pairs randomly generated in the whole vocabulary (purple diamonds). The generation of all pairs of distinct tokens produces N tok (N tok − 1)/2 pairs per semantic area, N tok being the number of tokens. Although the pairs can be reduced via random sampling, an accurate evaluation requires a large number of pairs.

Vote Collection
Once we have defined the pairs of tokens, which will be referred to as items hereafter, we want to rank them, with particular emphasis on top ranks, according to the opinions of a large number N voters of non experts. Due to the large number of items involved, the complete ranking of all items would be an unfeasible task for a human and it is convenient to reformulate it in terms of pairwise comparisons (Fürnkranz and Hüllermeier, 2010;Heckel et al., 2019Heckel et al., , 2018Jamieson and Nowak, 2011;Negahban et al., 2017;Park et al., 2015;Wauthier et al., 2013). Moreover, the complete exploration of the N items (N items − 1)/2 pairs of distinct items would be extremely expensive in terms of votes; luckily enough, it has been proven to be nonnecessary in many studies (Jamieson and Nowak, 2011;Negahban et al., 2017;Park et al., 2015;Wauthier et al., 2013).
Louviere and Woodworth (1991) (see also Kiritchenko and Mohammad (2017)) proposed a faster alternative to pairwise comparisons, known as bestworst scaling. In this case n-tuples (typically, n = 4), rather than pairs, are presented to the voter, who is required to identify the best and the worst items in each tuple, according to the relatedness of the corresponding tokens. This approach's drawbacks are a reduction, for n > 3, in the control on which pairs are actually checked and an increase in the complexity of each vote, which is particularly unwanted in crowdsourcing vote collections. For this reason, we rely on the standard pairwise comparison: we generate N comp pairs of items (as described in Sections 2.4 and 2.5), each one to be presented to one voter, who is requested to identify the item formed by the most similar tokens.

Uniform Item Selection
In our setting, each item i is presented to the voters a total number of times M i , and we define a score where n i represents the total number of times item i was the winner 4 in the vote collection; note that x i corresponds to an empirical approximation of the average probability 5 -known as Borda score in the context of social choice theory -that item i beats a randomly chosen item j = i, where the accuracy of the approximation increases as M i increases (Borda, 1784;Heckel et al., 2019).
In the absence of a priori knowledge on the expected scores, a reasonable approach for the data collection consists of presenting each item the same number M i of times to the voters. In this scenario, we randomly generate N comp pairs of items, with the constraint that M i = 2N comp /N items ∀i. The only way to increase M i -which is a proxy of the accuracy of the x i score defined in Equation 1 -is, therefore, to increase the total number of comparisons N comp .  2017) proposed so-called adaptive approaches to increase of the efficiency of pairwise comparisons by identifying, before each comparison, the optimal pair of items to be compared based on the votes already collected, on the task to be solved (typically, finding the global ranking or a ranking-induced partition), and on assumptions on the vote distribution.

Adaptive Item Selection
The application of an adaptive approach in our context requires two additional ingredients which, to the best of our knowledge, are still missing in the literature: (i) in order to avoid overfitting on the opinion of the fastest voters and to allow simultaneous voting, the choice of the pairs to be presented must occur in a few events, as each of these events causes a discontinuity in the vote collection (namely, this choice requires to suspend the vote collection when the numbers of comparisons reach the desired distribution among the voters); (ii) the goal is a selective increase in the precision (proxied by M i ) of top ranks (with no need of a 4 Ties can be accounted for by defining ni = n w i + n t i /2, where n w i (n t i ) represents the number of wins (ties) for item i.
priori knowledge on the semantic relatedness of the tokens), rather than a general improvement in the global ranking. In Section 3 we define an appropriate metric to quantify top-rank accuracy.
The key idea is to subdivide the voting procedure in n b subsequent ballots in which pairwise comparisons, based on a list of pairs determined before the beginning of the ballot, are presented to the voters. During the first ballot, the pairs are randomly drawn from all items, with the constraint that each item appears M times 6 , while in each subsequent ballot k, the pairs are drawn from the N (k) items toprank items selected according to the results of the previous ballots, with analogous constraint. More in detail, we define where α represents the fraction of items selected at each ballot. Since each item contained in ballot k appears M times within the pairs of such ballot, the total number of comparisons can be written as (3) Thus, each item i which survives up to the last ballot, is presented to the voters a number of times where M unif = 2N comp /N items is the number of comparisons per item in the case of uniform item selection with the same total number of comparisons; the last approximation holds whenever α n b 1, i.e., when the fraction of items which survive up to the last ballot is small. According to Equation 4, the score precision for top-rank items can be increased by decreasing the fraction α of selected items or by increasing the number n b of ballots; in Section 2.7 we discuss bounds on these values.

Score Calculation
At each ballot k and for each item i contained in k, we can evaluate a score x and this discrepancy must be taken into account in order to average scores from different ballots.
We define therefore a rescaled score y i is defined as the average of all rescaled scores up to ballot k, i.e., We enforce the f (k) resc (1) = 1 constraint in the linear interpolation, obtaining 7 , for k > 1, (6) Figure 2 shows an example of the interpolation on data simulated via the stochastic model described in Section 4.1. Note that Equation 5 provides a sequence of approximations of the Borda scores with accuracy increasing with k; top ranks are expected to survive up to the last ballot, and therefore to be highly accurate.

Choice of Parameter Values
We provide here heuristics to identify ranges of values for the parameters of the adaptive approach. 7 Since the winning chances of a given item j decrease at each ballot, on average, x (k) j ≤ȳ (k−1) j . Heuristically, we expect that, on average, Number n b of Ballots. We need n b ≥ 2 for the adaptive approach to be meaningful, while, to limit the discontinuities in the vote collection, a reasonable upper bound is n b 10.
Fraction α of Selected Items. As the purpose of the adaptive approach is to focus votes on top rank items, a reasonable request is to have no more than 10% of items surviving up to the last ballot, which gives an upper bound α (0.1) 1/(n b −1) . On the other hand, at least two items must be present in the last ballot; according to Equation 2, this implies a lower bound α (2/N items ) 1/(n b −1) .
Number N comp of Comparisons. To achieve the desired precision (namely, the statistical significance of the averages) on top-rank scores, a reasonable request is M top 100; using Equation 2 in the α n b 1 limit, this implies N comp ; (ii) for given x i fluctuations, decreasing α we increase the probability of top ranks' premature loss due to stricter selection. A rigorous derivation of the optimal values of M and α as a trade-off between these phenomena is beyond the scope of this section, as the bounds discussed above provide heuristic ranges.

Evaluation Metrics
Although the approximations of the Borda scores described in Sections 2.4 and 2.5 can be thought as estimates of the semantic relatedness, we rely on rankings rather than scores to avoid inconsistency issues that frequently emerge in score comparisons (Ammar and Shah, 2011;Negahban et al., 2017). Kendall (1948) proposed the quite general form for a ranking correlation coefficient where a ij (resp., b ij ) is a matrix that depends on the first (second) ranking to be compared, with indices i, j running over all items. This definition contains Spearman's ρ (Spearman, 1961) as a particular case with a ij = a j − a i and b ij = b j − b i , while Kendall's τ (Kendall, 1938;Kruskal, 1958) is obtained with a ij = sign(a j − a i ) and b ij = sign(b j − b i ), where {a i } and {b i } are the rankings to be compared. In order to take into account the larger importance of top ranks in our context, we define weighted versions of ρ and τ , with increasing weight at the increasing of the rank position. Namely, we define (i) where w i is the normalized weight associated to the i-th position in the rankings. These coefficients can be rewritten respectively as is a normalization factor, which corresponds to 1− i w 2 i in the absence of ties; these metrics have been emerging, albeit some notation differences, as extensions of the ρ and τ coefficients to take into account the larger importance of top ranks (Pinto da Costa and Soares, 2005;Dancelli et al., 2013;Vigna, 2015). Different weighting schemes have been proposed in the literature (Dancelli et al., 2013;Vigna, 2015); here we adopt the additive scheme with w a i = f (a i ) and w b i = f (b i ), where f (n) is a monotonically decreasing function, in view of its ability in discriminating different rankings even when they only differ by the exchange of a top rank and a low rank (Dancelli et al., 2013).
A common choice is f (n) = 1/n (Dancelli et al., 2013;Vigna, 2015); however, in the large N items limit, it causes the divergence of the denominator in Equation 9 and makes thus any w i negligible. This phenomenon is responsible for the decreased sensitivity on top ranks, observed by Dancelli et al. (2013), in case of long rankings. For this reason, we prefer to use f (n; n 0 ) = 1/(n + n 0 ) 2 , where the offset n 0 has been introduced to control the weight fraction associated to the first rank in the large N items limit, i.e., R(n 0 ) = f (1; n 0 )/ ∞ n=1 f (n; n 0 ), which can be expressed as R(n 0 ) = 1/[(n 0 + 1) 2 ψ (1) (n 0 + 1)], where ψ (1) (x) is the first derivative of the digamma function. With this choice, both ρ w and τ w defined in Equation 8 represent a family of correlation coefficients, depending on the value of n 0 , whose choice depends on the particular task (namely, on the relative importance of the first rank). The value n 0 = 0 causes an extremely high sensitivity on the first rank (R(0) ∼ 0.61), which may be excessive; hereafter, we focus therefore on the value n 0 = 2, which appears to be a reasonable trade-off (R(2) ∼ 0.28) that allows focusing on the first rank while avoiding neglecting other ranks.
The metrics ρ w and τ w are suitable to compare rankings, whenever top ranks are particularly important; in particular, they can be used to evaluate a semantic model using a dataset produced as described in this paper.

Evaluation of the Data-Collection Framework
The collection of human annotations to construct a domain-specific dataset is a resource-consuming process, even within the proposed optimized data collection approach, whose person-hours cost can be estimated as wheret comp is the average time needed for a single comparison. For this reason, in Section 4.1, we define a stochastic model for semantic pairwise comparisons, which can be used to simulate the voting before the collection of human annotations, e.g., for checking or tuning the parameters of the data collection approach. This stochastic model will be used in Section 4.2 to compare the effectiveness of the adaptive and the uniform approaches, using the metrics defined in Section 3.

Semantic Pairwise Comparisons
We want to model N voters voters to whom are proposed N comp pairwise comparisons and who are asked to identify the item containing the most semantically related tokens. The model will be used to reconstruct an approximate ranking of the items. For the sake of mathematical simplicity, we firstly focus on similarity-driven comparisons, where the similarity z takes value in the symmetric interval [−1, 1], where z = 1, 0, and −1 correspond respectively to synonyms, unrelated tokens, and antonyms. The model will eventually be adapted to semantic relatedness by using the Figure 3: We represent the three underlying similarity distributions described in Section 4.1.1 and the two relatedness distributions described in Section 2.2; relatedness is quantified by | cos θ|, where θ is the angle between the corresponding vectors in the embedding. fact that, since antonyms correspond to semantically related tokens (Cai et al., 2010;Harispe et al., 2015), the absolute value |z| is a reasonable proxy for semantic relatedness.

Similarity-Driven Comparisons.
A convenient way to model similarity-driven pairwise comparisons assumes the existence of an underlying (unknown) similarity distribution {z i }, which determines the theoretical ranks of the items, which in turn can be compared with the ranks estimated via the model. We consider here three examples:(i) an exponential z i = 2 exp(−i/N items ) − 1, (ii) a power law z i = 2/(1 + i/N items ) − 1, and (iii) the distribution of the cosine similarity 8 between pairs of tokens in the word embedding described in Section 2.2; these distributions are represented in Figure 3.
A fundamental aspect to be considered in modelling similarity-driven pairwise comparisons is the task's subjectivity, as many potential linguistic, psychological, and social factors could introduce biases (Bakarov, 2018;Faruqui et al., 2016;Gladkova and Drozd, 2016). A possible approach to account for this problem is via a stochastic transitivity model, firstly introduced in the context of comparative judgment of physical stimuli by Thurstone (1927) (see also Cattelan, 2012;Ennis, 2016); this model describes the opinion o  Faruqui et al. (2016), cosine similarity is typically considered a proxy of semantic similarity (Auguste et al., 2017;Banjade et al., 2015).
is a Gaussian-distributed random variable with zero mean and unit variance, while σ v represents the nonconformity amplitude, i.e., the discrepancy between the voter's opinion and the underlying similarity. 9 Here we define a modified version of the Thurstonian model, with a stochastic amplitude σ v depending on the underlying similarity, so that where F (x) = max(−1, min(1, x)) has been introduced to enforce the constraint −1 ≤ o i ≤ 1, analogous to the one discussed above for z. In the absence of z i dependence in the nonconformity amplitude, the probability P out (z i ) to have z i + σ v η i outside the interval [−1, 1] would tend to 1/2 as z i approaches one of the boundaries of the interval, causing, due to the F constraint, the collapse of a relevant fraction of opinion o i to either −1 or 1. This degeneracy can be avoided with σ v (z i ) proportional to 1 − z i and 1 + z i as z i approaches 1 and −1 respectively. Here we consider the simplest form with these features, i.e., , which makes particularly sense in our context, where each item i represents a pair of tokens, and the closest the similarity is to z i = 1 (z i = −1), the higher is the relation (the opposition) between the tokens in the corresponding pair, and the stronger is expected to be the agreement in the voters' opinions on their similarity.
In order to increase the accuracy of the model, we introduce another source of randomness that represents the distraction level of the voter, i.e., its tendency -observed, e.g., by Bakarov (2018) and Bruni et al. (2014) -to unintentionally vote for the item perceived as lower rank. This tendency is accounted for by assuming that the result of a pairwise comparison presented to voter v is actually the item with the highest perceived score o The proposed model depends on the underlying similarity distribution 10 , on the number of voters, on the random variables η (v) i , and on the voterdistinctive parameters σ * v and v , whose distribu-9 Contrary to the original formulation, no covariance terms are present here, as the voters are supposed to be noninteracting. Moreover, in the original formulation, voters and items are respectively referred to as judges and stimuli.
10 However, as shown in Table 2, the dependence is mild. tion could be experimentally determined by analyzing human voting. In the absence of such analysis, it seems reasonable to uniformly draw σ * v and v from ranges covering one order of magnitude to encompass human variability; heuristic upper bounds are v 0.05 and σ * v 0.2, as oversights are supposed to be rare, and the probability P out (0) that two completely unrelated tokens (z i = 0) are deliberately considered as maximally related (o (v) i = ±1) should be extremely low: the aforementioned bound corresponds indeed to P out (0) 0.001%.

Relatedness-Driven Comparisons.
As discussed in Section 4.1, we consider the absolute value of similarity as a proxy of relatedness. The model defined in Section 4.1.1 is thus extended to relatedness-driven comparisons by (i) including an absolute value in Equation 11, so that o i )| and (ii) defining the theoretical rank of item i according to |z i |.

Results
We estimated the accuracy of a data collection approach by comparing, via the metrics defined in Section 3, the ranking that it produces with the underlying theoretical ranks. We considered the semantic area described in Section 2.2, containing 990 items, and we simulated a relatedness-driven data collection based on (i) the adaptive approach described in Section 2.5, with N comp = 39000, M = 20, α = 0.5, n b = 7 and (ii) the uniform approach described in Section 2.4, with the same total number of comparisons. The voting was simulated with the stochastic model described in Section 4.1, with N voters = 100 and based on all three discussed distributions for the underlying similarity; for each voter v, the nonconformity level σ * and was run on a local machine equipped with an Intel Core i7-7700HQ (2.80GHz x8), with average runtimes of 30.9 s and 33.2 s respectively for the adaptive and the uniform approaches. The results of the simulations are presented in Table 2, which contains, as measures of the accuracy of the proposed approaches, the ρ w and τ w coefficients defined in Equation 8 and discussed in Section 3; in order to check the overall rank accuracy, we also report the standard Spearman's ρ and Kendall's τ coefficients. For each coefficient, we report the average value and the unbiased estimator of the standard deviation over the 50 simulations. The adaptive approach, compared with the uniform approach, determines a relevant increase in both ρ w and τ w for any of the underlying similarity distributions considered, with no relevant changes in the overall rank precision measured by ρ and τ . Moreover, the results suggest that the proposed stochastic model is robust for changes in the underlying similarity distribution. Figure 4 displays the scores x (k) i calculated in the first 5 ballots and the final approximationȳ i , obtained in a simulation based on the adaptive approach with exponential underlying similarity and the parameters described above; the figure clearly shows that, as desired, theȳ i precision is substantially larger for top ranks.

Conclusion & Future Work
In this paper, we provided a protocol for the construction -based on adaptive pairwise comparisons and tailored on the available resources -of a dataset, which can be used to test or validate any relatedness-based domain-specific semantic model and which is optimized to be particularly accurate in top-rank evaluation. Moreover, we defined the metrics ρ w and τ w , extensions of well-known ranking correlation coefficients, to evaluate a semantic model via the aforementioned dataset by taking into account the greater significance of top ranks. Finally, we defined a stochastic transitivity model to simulate semantic-driven pairwise comparisons, which allows tuning the parameters of the data collection approach and which confirmed a significant increase in the performance metrics ρ w and τ w of the proposed adaptive approach compared with the uniform approach (see Table 2).
As future work, we plan to collect human annotations (i) to test the proposed data collection approach on real data and (ii) to assess the validity and estimate the parameters of the proposed stochastic transitivity model. Additional future investigations may include a deeper analysis of the mathematical and statistical properties of the weighted coefficients ρ w , τ w , as well as a rigorous derivation of the optimal values for the parameters of the data collection approach.