Efficient Online Scalar Annotation with Bounded Support

We describe a novel method for efficiently eliciting scalar annotations for dataset construction and system quality estimation by human judgments. We contrast direct assessment (annotators assign scores to items directly), online pairwise ranking aggregation (scores derive from annotator comparison of items), and a hybrid approach (EASL: Efficient Annotation of Scalar Labels) proposed here. Our proposal leads to increased correlation with ground truth, at far greater annotator efficiency, suggesting this strategy as an improved mechanism for dataset creation and manual system evaluation.


Introduction
We are concerned here with the construction of datasets and evaluation of systems within natural language processing (NLP). Specifically, humans providing responses that are used to derive graded values on natural language contexts, or in the ordering of systems corresponding to their perceived performance on some task.
Many NLP datasets involve eliciting from annotators some graded response. The most popular annotation scheme is the n-ary ordinal approach as illustrated in Figure 1(a). For example, text may be labeled for sentiment as positive, neutral or negative (Wiebe et al., 1999;Pang et al., 2002;Turney, 2002, inter alia); or under political spectrum analysis as liberal, neutral, or conservative (O'Connor et al., 2010;Bamman and Smith, 2015). A response may correspond to a likelihood judgment, e.g., how likely a predicate is factive (Lee et al., 2015), or that some natural language inference may hold (Zhang et al., 2017). Responses may correspond to a notion of semantic Figure 1: Elicitation strategies for graded response include direct assessment via ordinal or scalar judgments, and pairwise comparisons aggregated via an assumption of latent distributions such as Gaussians, or novel here: Beta distributions, providing bounded support. The example concerns subjective assessments of the lexical frequency of dog. In pairwise comparison, we assess it by comparison such as "burrito" is less frequent (≺) than "dog". similarity, e.g., whether one word can be substituted for another in context (Pavlick et al., 2015), or whether an entire sentence is more or less similar than another (Marelli et al., 2014), and so on.
Less common in NLP are system comparisons based on direct human ratings, but an exception includes the annual shared task evaluations of the Conference on Machine Translation (WMT). There, MT practitioners submit system outputs based on a shared set of source sentences, which are then judged relative to other system outputs. Various aggregation strategies have been employed over the years to take these relative comparisons and derive competitive rankings between shared task entrants (Callison-Burch et al., 2012;Bojar et al., 2013Bojar et al., , 2014Bojar et al., , 2015Bojar et al., , 2016Bojar et al., , 2017.
Inspired by prior work in MT system evaluation, we propose a procedure for eliciting graded responses that we demonstrate to be more efficient than prior work. While remaining applicable to system evaluation, our experimental results suggest our approach as a more general framework for a variety of future data creation tasks, allowing for higher quality data in less time and cost.
We consider three different approaches for scalar annotation: direct assessment (DA), online pairwise ranking aggregation (RA), and a hybrid method which we call EASL (Efficient Annotation of Scalar Labels). 1 DA scalar annotation, shown in Figure 1(b), directly annotates absolute judgments on some scale (e.g., 0 to 100), independently per item ( §2). As an RA approach ( §3), we start with conventional unbounded models, where each instance is parameterized as a Gaussian distribution, as shown in Figure 1(c). Since boundedness is essential for the scalar annotation we aim to model, we propose a bounded variant which parameterizes each instance by a beta distribution as illustrated in Figure 1 (d). Finally, we propose EASL ( §4) that combines benefits of DA and RA.
We illustrate the improvements enabled by our proposal on three example tasks ( §5): lexical frequency inference, political spectrum inference and machine translation system ranking. 2 For example, we find that in the commonly employed condition of 3-way redundant annotation, our approach on multiple tasks gives similar quality with just 2way redundancy: this translates to a potential 50% increase in dataset size for the same cost.

Direct Assessment
Direct assessment or direct annotation (DA) is a straightforward method for collecting graded response from annotators. The most popular scheme is n-ary ordinal labeling, as illustrated in Figure 1(a), where annotators are shown one instance (i.e., sample point) and asked to label one of the n-ary ordered classes.
According to the level of measurement in psychometrics (Stevens, 1946, inter alia), which classifies the numerals based on certain properties (e.g., identity, order, quantity), ordinal data do not allow for degree of difference. Namely, there is no guarantee that the distance between each label is equal, and instances in the same class are not discriminated. For example, in a typical five-level Likert scale (Likert, 1932) of likelihood -very unlikely, unlikely, unsure, likely, very likely -we cannot conclude that very likely instances are exactly twice as likely those marked likely, nor can we assume two instances with the same label have exactly the same likelihood.
The issue of distance between ordinals is perhaps obviated by using scalar annotations (i.e., ratio scale in Stevens's terminology), which directly correspond to continuous quantities (Figure 1(b)). In scalar DA, 3 each instance in the collection (S i ∈ S N 1 ) is annotated with values (e.g., on the range 0 to 100) often by several annotators. The notion of quantitative difference is enabled by the property of absolute zero: the scale is bounded. For example, distance, length, mass, size etc. are represented by this scale. In the annual shared task evaluation of the WMT, DA has been used for scoring adequacy and fluency of machine learning system outputs with human evaluation (Graham et al., 2013(Graham et al., , 2014Bojar et al., 2016Bojar et al., , 2017, and has separately been used in creating datasets such as for factuality (Lee et al., 2015).
Why perhaps obviated? Because of two concerns: (1) annotators may not have a pre-existing, well-calibrated scale for performing DA on a particular collection according to a particular task; 4 and (2) it is known that people may be biased in their scalar estimates (Tversky and Kahneman, 1974). Regarding (1), this motivates us to consider RA on the intuition that annotators may give more calibrated responses when performed in the context of other elements. Regarding (2), our goal is not to correct for human bias, but simply to more efficiently converge to the same consensus judgments already being pursued by the community in their annotation protocols, biased or otherwise. 5 3 Online Pairwise Ranking Aggregation

Unbounded Model
Pairwise ranking aggregation (Thurstone, 1927) is a method to obtain a total ranking on instances, assuming that scalar value for each sample point follows a Gaussian distribution, N (µ i , σ 2 ). The parameters {µ i } are interpreted as mean scalar annotation. 6 Given the parameters, the probability that S i is preferred ( ) over S j is defined as where Φ(·) is the cumulative distribution function of the standard normal distribution. The objective of pairwise ranking aggregation (including all the following models) is formulated as a maximum log-likelihood estimation: TrueSkill TM (Herbrich et al., 2006) extends the Thurstone model by applying a Bayesian online and active learning framework, allowing for ties. TrueSkill has been used in the Xbox Live online gaming community, 7 and has been applied for various NLP tasks, such as question difficulty estimation (Liu et al., 2013), ranking speech quality (Baumann, 2017), and ranking machine translation and grammatical error correction systems with human evaluation (Bojar et al., 2014(Bojar et al., , 2015Sakaguchi et al., 2014Sakaguchi et al., , 2016 In the same way as the Thurstone model, TrueSkill assumes that scalar values for each instance S i (i.e., skill level for each player in the context of TrueSkill) follow a Gaussian distribution N (µ i , σ 2 i ), where σ i is also parameterized as the uncertainty of the scalar value for each instance. Importantly, TrueSkill uses a Bayesian online learning scheme, and the parameters are iteratively updated after each observation of pairwise comparison (i.e., game result: win ( ), tie (≡), or loss (≺)) in proportion to how surprising the outcome is. Let t i j = µ i − µ j , the difference in scalar responses (skill levels) when we observe i wins j, and 0 be a parameter to specify the tie rate. The update functions are formulated as follows: 6 Thurstone and another popular ranking method by Elo (1978) use a fixed σ for all instances. 7 www.xbox.com/live/ where c 2 = 2γ 2 +σ 2 i +σ 2 j , and v are multiplicative factors that affect the amount of change (surprisal of the outcome) in µ. In the accumulation of the variances (c 2 ), another free parameter called "skill chain", γ, indicates the width (or difference) of skill levels that two given players have 0.8 (80%) probability of win/lose. The multiplicative factor depends on the observation (wins or ties): where ϕ(·) is the probability density function of the standard normal distribution. As shown in Figure 2 (a) and (b), v i j increases exponentially as t becomes smaller (i.e., the observation is unexpected), whereas v i≡j becomes close to zero when |t| is close to zero. In short, v becomes larger as the outcome is more surprising. In order to update variance (σ 2 ), another set of update functions is used: where w serve as multiplicative factors that affect the amount of change in σ 2 .
As shown in Figure 2 (c) and (d), the value of w is between 0 and 1. The underlying idea for the variance updates is that these updates always decrease the size of the variances σ 2 , which means uncertainty of the instances (S i , S j ) always decreases as we observe more pairwise comparisons. In other words, TrueSkill becomes more confident in the current estimate of µ i and µ j . Further details are provided by Herbrich et al. (2006). 8 Another important property of TrueSkill is "match quality (chance to draw)". The match quality helps selecting competitive players to make games more interesting. More broadly, the match quality enables us to choose similar instances to be compared to maximize the information gain from pairwise comparisons, as in the active learning literature (Settles et al., 2008). The match quality between two instances (players) is computed as follows: Intuitively, the match quality is based on the difference µ i − µ j . As the difference becomes smaller, the match quality goes higher, and vice versa. As mentioned, TrueSkill has been used for NLP tasks to infer continuous values for instances. However, it is important to note that the support of a Gaussian distribution is unbounded, namely R = (−∞, ∞). This does not satisfy the property of absolute zero of scalar annotation in the level of measurement ( §2). It becomes problematic when it comes to annotating a scalar (continuous) value for extremes such as extremely positive or negative sentiments. We address this issue by proposing a novel variant of TrueSkill in the next section.

Bounded Variant
TrueSkill can induce a continuous spectrum of instances (such as skill level of game players) by assuming that each instance is represented as a Gaussian distribution. However, the Gaussian distribution has unbounded support, namely R = (−∞, ∞), which does not satisfy the property of absolute bounds for appropriate scalar annotation (i.e., ratio scale in the level of measurement).
Thus, we propose a variant of TrueSkill by changing the latent distribution from a Gaussian to a beta, using a heuristic algorithm based on TrueSkill for inference. The Beta distribution has natural [0, 1] upper and lower bounds and a simple parameterization: We choose the scalar response as the mode M[S i ] of the distribution and the variance as uncertainty: 9 As in TrueSkill, we iteratively update parameters of instances B(α, β) according to each observation and how it is surprising. Similarly to Eqns. (3) and (4), we choose the update functions as follows; 10 first, in case that an annotator judged that S i is preferred to S j (S i S j ), in case of ties with |D| > and M i > M j , and in case of ties with |D| , for both S i , S j , 9 We may have instead used the mean (E[Si] = α i α i +β i ) of the distribution, where in a beta (α, β > 1) the mean is always closer to 0.5 than the mode, whereas mean and mode are always the same in a Gaussian distribution. The mode was selected owing to better performance in development. Regarding the probability of pairwise comparison between instances, we follow Bradley and Terry (1952) and Rao and Kupper (1967) to describe the chance of win, tie, or loss, as follows: 0 is a parameter to specify the tie rate, θ = exp ( ), and π is an exponential score function of S; π i = exp(M i ).
It is important to note that α and β never decrease (because 1 − p ≥ 0 as shown Figure 3), which satisfies the property that variance (uncertainty) always decreases as we observe more judgments, as seen in TrueSkill ( §3.1). In addition, we do not need individual update functions for µ and σ 2 , since the mode and variance in beta distribution depend on two shared parameters α, β (Eqns. 12 and 13).
Regarding match quality, we use the same formulation as the TrueSkill (Eqn. 11), except that the bounded model uses M instead of µ:

Efficient Annotation of Scalar Labels
In the previous section, we propose a bounded online ranking aggregation model for scalar annotation. However, the amount of update by a pairwise judgment depends only on the distance between instances, not on the distance from the bounds (i.e., 0 and 1). To integrate this property into the online ranking aggregation model, Figure 4: Illustrative example of the EASL protocol. Each instance is represented as a beta distribution. Instances are chosen to annotate according to the variance and match quality, and the parameters are updated iteratively.
we propose EASL (Efficient Annotation of Scalar Labels) that combines benefits from both direct assessment (DA) and bounded online ranking aggregation model (RA). 11 Similarly to RA, EASL parameterizes each instance by a beta distribution (Eqns. 12 and 13), and the parameters are inferred using a computationally efficient and easy-to-implement heuristic. The difference from RA is the type of annotation. While we ask for discrete pairwise judgment ( , ≺, ≡) between S i and S j in RA, here we directly ask for scalar values for them (denoted as s i and s j ) as in DA. Thus, given an annotated score s i which is normalized between [0,1], we change the update functions as follows: This procedure may look similar to DA, where s i is simply accumulated and averaged at the end. However, there are two differences. First, as illustrated in Figure 4, EASL parameterizes each instance as a probability distribution while DA does not. Second, DA elicits annotations independently per element, whereas EASL elicits annotations on elements in the context of other elements selected jointly according to match quality.
Further, DA generally uses a batch style annotation scheme, where the number of annotations per instance is independent from the latent scalar values. On the other hand, EASL uses online learning, which impacts the calculation of match quality. This allows us to choose instances to annotate

Experiments
To compare different annotation methods, we conduct three experiments: (1) lexical frequency inference, (2) political spectrum inference, and (3) human evaluation for machine translation systems.
In all experiments, data collection is conducted through Amazon Mechanical Turk (AMT). We ask annotators who meet the following minimum requirements: 12 living in the US, overall approval rate > 98%, and number of tasks approved > 500.
The experimental setting for DA is straightforward. We ask annotators to annotate a scalar value for each instance, one item at a time. We collect ten annotations for each instance to see the relation between the number of annotations and accuracy (i.e., correlation).
To set up the online update in RA and EASL, we use a partial ranking framework with scalars, where annotators are asked to rank and score n instances at one time as illustrated in Figure 5. In all three experiments, we fix n = 5. The partial ranking yields n 2 pairwise comparisons for RA and n scalar values for EASL. 13 It is important to note that we can simultaneously retrieve pairwise 12 In all experiments, we set the reward of single instance to be $0.01 (i.e., $0.05 in RA and EASL). This is $8/hour, assuming that annotating one instance takes five seconds. Prior to annotation, we run a pilot to make sure that the participants understand the task correctly and the instructions are clear. 13 The partial ranking can be regarded as mini-batching.
Algorithm 1: Online pairwise ranking aggregation with bounded support. In each iteration, n instances are selected by variance and match quality. We first select top k (= N/n) instances according to the variance, and for each selected instance we choose the other n − 1 instances to be compared based on match quality. This approach has been used in the NLP community in tasks such as for assessing machine translation quality (Bojar et al., 2014;Sakaguchi et al., 2014;Bojar et al., 2015Bojar et al., , 2016 to collect pairwise judgments efficiently. The detailed procedure of iterative parameter updates in the RA and EASL is described in Algorithm 1. As mentioned in Section 4, the main difference between RA and EASL is the update functions (line 7).
Model hyper-parameters in RA and EASL are set as follows; each instance is initialized as α init i = 1.0, β init i = 1.0. The skill chain parameter γ and tie-rate parameter are set to be 0.1. 14

Lexical Frequency Inference
In the first experiment, we compare the three scalar annotation approaches on lexical frequency inference, in which we ask annotators to judge frequency (from very rare to very frequent) of verbs that are randomly selected from the corpus of Contemporary American English (COCA) 15 . We include this task for evaluation owing to its nonsubjective ground truth (relative corpus frequency) which can be used as an oracle response we would like to maximally correlate with. 16 We randomly select 150 verbs from COCA; the log frequency (log 10 ) is regarded as the oracle. In DA, each instance is annotated by 10 different annotators. 17 In the RA and EASL, annotators are asked to rank/score five verbs for each HIT (n = 5). Each iteration contains 20 HITS and we run 10 iterations, which means that total number of annotations is the same in DA, RA, and EASL. 18 Figure 6 presents Spearman's and Pearson's correlations, indicating how accurately each annotation method obtains scalar values for each instance. Overall, in all three methods, the correlations are increased as more annotations are made. The result also shows that RA and EASL ap-15 https://www.wordfrequency.info/ 16 Lexical frequency inference is an established experiment in (computational) psycholinguistics. E.g., human behavioral measures have been compared with predictability and bias in various corpora (Balota et al., 1999;Fine et al., 2014). 17 The agreement rate in DA (10 annotators) is 0.37 in Spearman's ρ. Considering the difficulty of ranking 150 verbs, this rate is fair. 18 Technically, the number of annotations per instance vary in RA and EASL, because they choose instances by match quality at each iteration. Figure 7: Histograms of scalar values on lexical frequency obtained by each annotation scheme (direct assessment (DA), online ranking aggregation (RA), and EASL), and the oracle. The scalar annotations are put into five bins to see the overall distribution. The scalar in the oracle is normalized as log10(frequency(Si)) / max log10(frequency(S)).
proaches achieve high correlation more efficiently than DA. The gain of efficiency from DA to EASL is about 50%; two iterations in EASL achieves a close Spearman's ρ to three annotators in DA. Figure 7 presents the results of the final scalar values that each method annotated. The distribution of the histograms shows that overall three methods successfully capture the latent distribution of scalar values in the data. Figure 8 shows a dynamic change of match quality. In the beginning (iteration 0), all the instances are equally competitive because we have no information about them and initialize them with the same parameters. As iterations go on, the instances along the diagonal have higher match quality, indicating that competitive matches are more likely to be selected for a next iteration. In other words, match-quality helps to choose informative pairs to compare at each iteration, which reduces the number of less informative annotations (e.g., a pairwise comparison between the highest and lowest instances).

Political Spectrum Inference
In the second experiment, we compare the three scalar annotation methods for political spectrum inference. We use the Fine-Grained Political Statements dataset (Bamman and Smith, 2015), which consists of 766 propositions collected from political blog comments, paired with judgments about the political belief of the statement (or the person who would say it) based on the five ordinals: very conservative (-2), slightly conservative (-1), neutral (0), slightly liberal (1), and very liberal (2). We normalize the ordinal scores between 0 and 1. The dataset contains the mean scores by aggregating 7 annotations for each proposition. 19 We randomly choose 150 political propositions from the dataset (see the histogram in Figure 10 oracle). 20 The experimental setting (i.e., the number of annotations per instance, the number of iterations, and the number of HITS in each iteration) is the same as the lexical frequency inference experiment ( §5.1). Figure 9 shows Spearman's and Pearson's correlations to the oracle by each method. Overall, all the three methods achieve strong correlation above 19 We stress that the oracle here derives from subjective annotations: it does not necessarily reflect the true latent scalar values for each instance. However, in this experiment, we use them as a tentative oracle to compare three scalar annotation methods objectively. 20 The agreement rate in DA (among 10 annotators) is 0.67 in Spearman's ρ. This is significantly high, considering the difficulty of ranking 150 instances in order.  0.9. We also find that RA and EASL reach high correlation more efficiently than DA as in the lexical frequency inference experiment ( §5.1). The gain of efficiency from DA to EASL is about 50%; 4-way redundant annotation in EASL achieves a close Spearman's ρ to 6-way redundancy in DA. Figure 10 presents the results of the annotated scalar values by each method. The distribution of the histograms shows that DA and EASL successfully fit to the distribution in the oracle, whereas RA converges to a rather narrow range. This is because of the "lack of distance from bounds" in RA that is explained in §4. We note that renormalizing the distribution in RA will not address the issue. For instance, when the dataset has only liberal propositions, RA still fails to capture the latent distribution because it looks only at relative distances between instances but not the distance from bounds. Table 1 shows the examples of scalar annotations by each method. Again, we see that RA approach has a narrower range than the oracle, DA, and EASL.

Ranking Machine Translation Systems
In the third experiment, we apply the scalar annotation methods for evaluating machine translation systems. This is different from two previous experiments, because the main purpose is to rank the MT systems (S N 1 ) rather than the adequacy (q) of each MT output for a given source sentence (m). Namely, we want to rank S i by observing q i,m .
We use WMT16 German-English translation dataset (Bojar et al., 2016), which consists of 2,999 test set sentences and the translations from 10 different systems with DA annotation. Each sentence has its adequacy score annotation between 0 and 100, and the average adequacy scores are computed for each system for ranking. In this setting, annotators are asked to judge adequacy of system output(s) with the reference being given. The official scores (made by DA) and ranking in WMT16 are used as the oracle in this experiment.
In this experiment, we replicate DA and run EASL to compare the efficiency. We omit RA in this experiment, because it does not necessarily capture the distance from bounds as shown in the previous experiment ( §5.2). In DA, 33,760 translation outputs (3,376 sentences per system in average) are randomly sampled without replacement to make sure that it reaches up to the same result as oracle when the entire data are used.
In EASL, we assume that adequacy (q) of an MT output by system (S i ) for a given source sentence (m) is drawn from beta distribution: q i,m ∼ B(α i , β i ). 21 Annotators are asked to judge adequacy of system outputs by scoring 0 and 100. Similarly to the previous experiments ( § 5.1 and § 5.2), we use the partial ranking strategy, where we show n = 5 system outputs (for the same source sentence l) to annotate at a time. The procedure of parameter updates is the same as previous experiments (Algorithm 1). We compare the correlations (Spearman's ρ) of system ranking with respect to the number of annotations per system, and the result is shown in Figure 11. As seen in the previous two experiments, EASL achieves higher Spearmans correlation on ranking MT systems with smaller number of annotations than the baseline method (DA), 21 This is the same setting as WMT14, WMT15, and WMT16 (Bojar et al., 2014(Bojar et al., , 2015, although they used TrueSkill (Gaussian) instead of EASL to rank systems. Figure 11: Spearman's correlation on ranking machine translation systems on WMT16 German-English data: direct assessment (DA), and EASL. The shade for each line indicates 95% confidence intervals by bootstrap resampling (running 100 times).
which means EASL is able to collect annotation more efficiently. The result shows that EASL can be applied for efficient system evaluation in addition to data curation.

Conclusions
We have presented an efficient, online model to elicit scalar annotations for computational linguistic datasets and system evaluations. The model combines two approaches for scalar annotation: direct assessment and online pairwise ranking aggregation. We conducted three illustrative experiments on lexical frequency inference, political spectrum inference, and ranking machine translation systems. We have shown that our approach, EASL (Efficient Annotation of Scalar Labels), outperforms direct assessment in terms of annotation efficiency and outperforms online ranking aggregation in terms of accurately capturing the latent distributions of scalar values. The significant gains demonstrated suggests EASL as a promising approach for future dataset curation and system evaluation in the community.