Crowdsourcing and Aggregating Nested Markable Annotations

One of the key steps in language resource creation is the identification of the text segments to be annotated, or markables, which depending on the task may vary from nominal chunks for named entity resolution to (potentially nested) noun phrases in coreference resolution (or mentions) to larger text segments in text segmentation. Markable identification is typically carried out semi-automatically, by running a markable identifier and correcting its output by hand–which is increasingly done via annotators recruited through crowdsourcing and aggregating their responses. In this paper, we present a method for identifying markables for coreference annotation that combines high-performance automatic markable detectors with checking with a Game-With-A-Purpose (GWAP) and aggregation using a Bayesian annotation model. The method was evaluated both on news data and data from a variety of other genres and results in an improvement on F1 of mention boundaries of over seven percentage points when compared with a state-of-the-art, domain-independent automatic mention detector, and almost three points over an in-domain mention detector. One of the key contributions of our proposal is its applicability to the case in which markables are nested, as is the case with coreference markables; but the GWAP and several of the proposed markable detectors are task and language-independent and are thus applicable to a variety of other annotation scenarios.


Introduction
Developing Natural Language Processing (NLP) systems still requires large amounts of annotated text to train models, or as a gold standard to test the effectiveness of such models.The approach followed to create the most widely used data (Marcus et al., 1993;Palmer et al., 2005;Pradhan et al., 2012) is to separate the task of identifying the text segments to be annotated-the markables-from the annotation task proper.In our specific case, the markables of interest are the mentions used in coreference resolution, to be labelled as belonging to a coreference chain or as singletons; typical examples of mentions are pronouns, named entities, and other nominal phrases (Poesio et al., 2016).
The annotation of mentions for coreference has similarities with the identification of the chunks for named entity resolution (NER), but mentions can and often are nested, as in the following example, from the Phrase Detectives corpus (Chamberlain et al., 2016)), where a mention of entity i is nested inside a mention of entity j.
[A wolf]i had been gorging on [an animal [he]i had killed]j The methods proposed in this paper are also applicable when markables are nested.
Mention identification for annotation is typically done semi-automatically, using first an automatic mention detector (or extractor) (Uryupina and Zanoli, 2016) and then checking its output by hand.Automatic mention detectors developed for coreference systems are generally used in the first step.Mention detection was recognized early on as a key step for overall coreference quality (Stoyanov et al., 2009;Hacioglu et al., 2005;Zhekova and Kübler, 2010;Uryupina and Zanoli, 2016), so a number of good quality mention detectors were developed, such as the mention detector included in the Stanford CORE pipeline (Manning et al., 2014), used by many of the top-performing systems in the 2012 CONLL Shared Task (Pradhan et al., 2012). 1 But this performance can be improved.The first contribution of this paper are new fully-trainable, language-independent mention detectors that outperform the Stanford CORE mention detector in a variety of genres.
But even the best automatic mention detectors do not achieve the accuracy required for high-quality corpus annotation, even when run in-domain: the difference in performance between running coreference resolvers on gold mentions and running them on system mentions can be of up to 20 percentage points, and the results are even poorer when running such systems out-of-domain, for domains like biomedicine (Kim et al., 2011) or for under-resourced languages (Soraluze et al., 2012).So a manual checking step is still required to obtain high-quality results. 2  Markable checking is increasingly done using crowdsourcing (Snow et al., 2008;Lawson et al., 2010;Bontcheva et al., 2017).But crowdsourcing, using microtask platforms such as Amazon Mechanical Turk can be too expensive for large scale annotation.For these cases, gamification tends to be a cheaper alternative (Poesio et al., 2013), also providing more accurate results and better contributor engagement (Lee et al., 2013).
The second contribution of this paper is an approach to mention detection for large-scale coreference annotation projects in which the output of mention detectors is corrected using a Gamewith-a-Purpose (GWAP) (Von Ahn and Dabbish, 2008).A Game-With-A-Purpose is a game in which players label data as a by-effect of playing.GWAPs have been successful in many annotation projects (Lafourcade et al., 2015).Examples of successful GWAPs include The ESP Game, in which players contribute image labels (Von Ahn and Dabbish, 2004), and FoldIt, in which players solve protein-structure prediction problems (Cooper et al., 2010).However, so far there have not been any truly successful GWAPs for NLP.It has proven difficult to go from simple gamification of a labelling task to developing a proper game: e.g., in one of the best-known GWAPs for NLP, Phrase Detectives (Poesio et al., 2013), the labelling remains the core of the game dynamics.Yet, games such as Puzzle Racer have shown that engaging GWAPs producing annotations for text are possible.Furthermore, that the annotations thus collected are of a quality comparable to that obtainable using microtask crowdsourcing, and at 2 One difference between the mention detectors used for coreference resolvers and those used to preprocess data for coreference annotation is relevant for subsequent discussion.The former usually aim for high recall and compromise on precision, placing more confidence/importance on the coreference resolution step (Kummerfeld et al., 2011) and being satisfied that incorrectly identified mentions will simply remain singletons which can be removed in post processing (Lee et al., 2011).The latter tend to go for high F.This difference played a role in our experiments, as discussed later.
a reduced cost (Jurgens and Navigli, 2014).However, such games have yet to achieve the player uptake or number of judgements comparable to GWAPs in other domains.Furthermore, it is not clear yet whether using GWAPs can result in better performance for tasks such as mention detection, for which good-performance systems exist.
In this work, automatically extracted mentions are checked using a two-player GWAP, TileAttack.Our previous analysis of the performance of TileAttack using player satisfaction metrics derived from the Free 2 Play literature suggests that we are succeeding in developing an engaging game (Madge et al., 2017).In this paper, we demonstrate that using TileAttack to check the output of our mention detector results in substantial improvement to the quality of the output.The game supports any text segmentation task, whether markables are nested or non-nested, aligned or not aligned, and is therefore applicable at least in principle to a variety of text annotation tasks besides coreference, including e.g., Named Entity Resolution (NER).
Key to this result is the use of a novel aggregation method to combine the labels produced by the mention detector with the labels collected using the game.A number of aggregation methods applicable to text segmentation labelling have been proposed (Dawid and Skene, 1979;Hovy et al., 2013;Passonneau and Carpenter, 2014;Felt et al., 2014;Rodrigues et al., 2014;Nguyen et al., 2017;Paun et al., 2018), but they are not directly applicable when markables can be nested.The third contribution of this paper is a novel method to use aggregation with potentially nested markables.We show that using this method to aggregate mention detector labels and TileAttack labels results in improved markable boundary quality.
According to this definition, candidate mentions include all noun phrases (NPs) and all possessive pronouns.Non-referring NPs (like It in It is sunny or a policeman in John is a policeman) and singletons are considered candidate mentions as well, possibly to be filtered during coreference annotation proper.
The maximal projection of the NP is marked; i.e., the full extent of the NP including premodifiers, post-modifiers and appositions.In the case of a coordinated NP such as Alice and Bob, each conjunct and the coordinated NP are treated as candidate mentions: [[Alice]i and [Bob]j] k went to the shops.[They] k had a coffee.

Two automated mention detectors
The first ingredient of our proposal is two strong mention detectorsto serve both as baselines and as AI opponents for TileAttack. 3 The first pipeline first parses the input sentences using a dependency parser and then extracts mentions from the dependency parse; we call this the DEP pipeline.The second pipeline is a modified version of the neural named entity recognition system proposed by Lample et al. (2016); we call it NN pipeline.Both pipelines are trained on the Penn Treebank (PTB).

DEP pipeline
Our DEP pipeline first parses input sentences using the Mate dependency parser (Bohnet and Nivre, 2012), then applies a rule based mention extractor.Our extractor follows a three steps approach.
It first extracts mention heads using heuristic patterns based on part-of-speech tags and dependency relations.The patterns are automatically extracted from the gold annotation of the Phrase Detectives 1.0 corpus (Chamberlain et al., 2016).We extract all the part-of-speech tags and dependency relations pairs of the mentions' head in the corpus, and use the most frequent patterns.The second step finds the maximum span related to a given mention head; for this we use the left/right-most direct or indirect children of the mention head as the start/end of the mention.The last step checks if any of the mentions created by step two overlap with each other.Overlapping mentions are replaced with the union of those mentions.

NN Pipeline(s)
Our second pipeline is based one the neural named entity recognition (NER) system proposed by  Lample et al. (2016).This takes a sentence as the input and outputs a sequence of IOB style NER labels.The system uses a bidirectional LSTM to encode sentences and applies a sequential conditional random layer (CRF) over the output of the LSTM.But the CRF, effective when handling sequence labelling tasks such as NER, is not suitable for predicting mentions, as mentions can be nested.We address this as follows.For each token we create a maximum l candidate mentions.Let s, e be the start and end indices of the mention, and x i the LSTM outputs on the i th token.The mention is represented by [x s , x e ].We add a mention width feature embedding (φ) and apply a self-attention over the tokens inside a mention ([x s ... x e ]) to create a weighted mention representation w se .After creating the mention representation [x s , x e , w se , φ], we use a feed-forward neural network with a sigmoid activation function on the output layer to assign each candidate mention a mention score.During training we minimise the sigmoid cross entropy loss.During prediction, mentions with a score above the threshold (t) are returned.The threshold can be adjusted to create models for different purposes.In particular, in this paper we experimented with two models: one optimized for high recall, the other for high F1.We use the same network parameters as Lample et al. (2016) except the two parameters introduced by our system.We set maximum mention width to 30 i.e. l = 30, and set t = 0.5/0.95 for our highrecall and high-F1 versions respectively.

Results
We use as a baseline the Stanford deterministic mention detector (Manning et al., 2014)-arguably, the most widely used mention detector for coreference with the CONLL dataset (Pradhan et al., 2012).Table 1 compares our pipelines and Stanford's on a number of data sets.Both of our pipelines consistently outperform the Stanford pipeline by a large margin.

TileAttack
To check the mentions produced by the automatic mention detectors discussed above we developed TileAttack, a web-based, two-player blind, token sequence-labelling game.Its visual design is inspired by Scrabble, with a tile-like visualisation shown in Figure 1.In the game, players perform a text segmentation task which involves marking spans of tokens represented by tiles.Players are awarded points based on player agreement.The game is highly adaptable to different corpora, sequence labelling tasks and languages.It is not easy to come up with an original game design.Our approach was to adopt a game design as close as possible to an existing recipe-specifically, the ESP Game (Von Ahn and Dabbish, 2008), but adapted to text annotation.Like The ESP Game, TileAttack has an "output-agreement" format, in which two players or agents are anonymously paired, and must produce the same output, for a given input (Von Ahn and Dabbish, 2008).This provided the opportunity to test what lessons learned from games similar to The ESP Game still apply to text annotation, games.

Gameplay
In each round, the player is shown a single sentence to annotate.Players can select a span from the sentence by simply selecting the start and end tokens of the item they wish to mark.A preview of their selection is then shown immediately below.To confirm this annotation, they may either click the preview selection or click the Annotate button.The annotation is then shown in the player's colour.When the two players match on a selection, the tiles for the selection in agreement are shown with a glinting effect, in the colour of the player that first annotated the tiles and a border colour of the player that agreed.The players' scores are shown at the top of the screen.
When players have finished, they click the Done button, upon which they will not be able to make any more moves, but will see their opponent's moves.Their opponent is also notified they have finished and invited to click Done once they have finished.Once both players have clicked Done, the round is finished and both players are shown a round summary screen.This screen shows the moves that both players agreed on, and whether they won or lost the round.
Clicking Continue then takes the player to a leaderboard showing wins, losses and the current top fifteen players.From this page they may click the Next Game button, to start another round.On the leaderboard, players are also offered the opportunity to sign up.

Opponents
Like all two-player GWAPs, TileAttack chooses an artificial agent as opponent for a player if no human opponent is available.An artificial agents is also used in crowdsourcing mode, as is the case with this study.The game uses three different artificial agents as opponents, selected in the following order of priority, descending to the next unless the condition is met: Silver AI replays the aggregated result of all player games so far Replay AI replays a recorded previous gameif a previous game is available for that item Pipeline AI Plays the moves from the automated pipeline

Aggregating Mentions
The boundaries labelled by non-experts can be expected to be quite noisy compared to expert annotations; but we can also expect the quality of the aggregated judgements to be comparable to that obtained with experts, provided sufficient non-experts are consulted (Snow et al., 2008).We are not aware however of any previous proposal to aggregate such annotations when they are nested.In this Section we introduce the two methods we used: a baseline on one based on taking the most popular judgement among the annotators (majority voting); and a probabilistic approach.Both these methods require a way for clustering together the mentions to be compared; we propose one such method in the first Section.

Head-based mention boundary clustering
To apply aggregation, it is necessary to determine which judgements (boundary pairs) are competing.We do this by clustering all annotations sharing the same nominal head.
The head of a player-generated candidate mention is extracted from the dependency parse used by the DEP pipeline after aligning the candidate mention with the dependency tree as follows.Given a player-generated candidate mention, we find first of all subtrees of the dependency tree that completely cover all the tokens in the candidate mention.The highest leftmost head of those subtrees is then considered as the head.If no nominal head is present in those subtrees, the candidate mention is not considered for aggregation.
Consider e.g. the sentence John's car is red.Suppose the players proposed the candidate mentions John's car, John, and the (incorrect) mention John's car is.Suppose also that the (automatically computed) dependency tree is as in Figure 2: Then John's car can be aligned with the subtree whose head is car; John's can be aligned with a subtree with head John.Both of these heads are nominal, so the two candidate mentions are considered for clustering.John's car is would be aligned with the two subtrees with the roots car and is, shown in Figure 2 by the red box.The highest leftmost head and therefore the head that would be used is car.Relaxing the alignment criteria this way is important to allow the pipeline to guide the clustering while not constraining newly proposed boundaries to the pipeline's overall interpretation (which may be incorrect).

Baseline: Majority Voting
Majority Voting was used as a baseline aggregation method.Following clustering, majority voting is applied to each cluster, choosing the boundary that has the highest number of votes among all those sharing the same nominal head.Ties are broken randomly; the process is rerun five times.

A Probabilistic Approach
The majority vote baseline implicitly assumes equal expertise among annotators, an assumption shown to be false in practice (Dawid and Skene, 1979;Passonneau and Carpenter, 2014).A probabilistic model of annotation, on the other hand, can capture annotators' different levels of ability (Paun et al., 2018).This Section describes an application of the model proposed by (Dawid and Skene, 1979) to the boundary detection task.
Each of the clusters collected as discussed above contains a number of candidate boundaries supplied by the players.The goal is to identify the correct boundary for each cluster.A multi-class version of the Dawid&Skene model cannot be applied since the class space (the boundaries) is not consistent (i.e., the same set) across the clusters.However, a binary version of the model can be applied after some careful data pre-processing.Concretely, for each boundary we obtain a series of binary decisions as a result of a "one vs. the others" encoding performed at cluster-level.For example, given a cluster whose annotations are the boundaries "a, b, a, a", we have for the "a" boundary a collection of "1, 0, 1, 1" decisions, while for the "b" boundary we have "0, 1, 0, 0".We then train a Bayesian version of the binary Dawid&Skene model on these boundary decisions.The model infers for each boundary a decision indicator which can be interpreted as whether the boundary is correct or not.After some simple post-processing, we assign for each cluster the boundary whose posterior indicator has the most mass associated with the positive outcome.
In order to evaluate our approach, we tested the mention boundaries obtained using the two proposed pipelines and by aggregating the judgements collected using TileAttack in several different ways over datasets in different genres.

Experiment setup
As said above, our approach to human checking of mentions produced by other players or by a system is to treat existing annotations as artificial agents that human players 'play against'.But we also pointed out that the mention detectors used for coreference resolution systems are optimised to achieve extremely high recall-the assumption being that the extra mentions will be filtered during coreference resolution proper-and that this optimisation may not be optimal when using an automatic mention detector for annotation-in our case, treating it as an agent from which the other players will derive feedback.In this context, a mention detector optimised for high overall F may be preferable, as it may provide better feedback to the human players.We tried therefore two versions of the NN pipeline in this experiment: one optimized for high recall, and one for high F 1 .The two configurations are shown in Figure 3.

Participant recruitment and platform use
The regular players of TileAttack are typically experts in language or language puzzles, and many of them are linguists or computational linguists.As a result, the quality of the mentions they produce tends to be very high, as shown in Table 2, which reports the aggregated results of these players on the sentences from the 'Other Domains' dataset when playing against the 'High recall' pipeline.Our players obtain an aggregated F of 90.5, which is very high.However, collecting judgements from real players tends to be slower than using a crowdsourcing service.Given that in this paper we were not concerned with comparing the effectiveness of crowdsourcing platforms and GWAPs, we collected the headline results for this experiment using judgements from participants recruited through Amazon Mechanical Turk.This was done for purely practical reasons-namely, ensuring we would collect sufficient data in a reasonably short time.

The participants' task
After completing the tutorial, participants are asked to annotate 3 sentences.At the end of each round, the participant is given feedback in the form of a comparison of their moves to those of the opponent.The participants are paid US $ 0.4 for completing the tutorial and three sentences on their first HIT, or five sentences on subsequent HITs.

Datasets
Most coreference datasets consist primarily of news text; for this reason, our first dataset, referred to below as "News", consists of 102 sentences from five randomly selected documents from the Wall Street Journal section of the Penn Treebank (Marcus et al., 1993), annotated with coreference as part of the ARRAU corpus (Uryupina et al., 2019).
The second dataset, referred to below as "Other Domains", is 180 sentences from a collection of our own creation consisting of documents covering different genres, from simple language learning texts and student reports, to Wikipedia pages and fiction from Project Gutenberg.We hand labelled the mentions in those sentences ourselves.

News dataset
102 sentences were annotated by 131 participants.Each sentence was annotated at least 8 times (max-imum of 11).A boundary was considered correct iff the start and end match exactly.
The results in Table 3 compare the results obtained using the four pipelines or application the two different aggregation approaches on the user (u), our DEP pipeline (d), NN (High F 1 and Recall configurations) and Stanford Pipeline (s).The presence or absence of the annotations for the users or pipelines is indicated by a preceding + or − respectively.M V indicates application of the majority voting aggregation method, and P the probabilistic aggregation method.The Table suggests, first of all, that the domain-trained pipelines outperform the domainindependent Stanford one, as expected.Second, that in this genre human judgements only match the domain-dependent pipelines when probabilistic aggregation is used.Third, that by aggregating user judgements and domain-dependent pipelines we see an improvement in F1 of up to 2.5 percentage points, but only with probabilistic aggregation.
In Figures 4 and 5 we plot F 1 to look at how many non-expert annotators are required to rival the performance of the pipelines using the respective aggregation methods.In Figure 4   cific pipeline.And in Figure 5, which shows the results aggregating participants with the pipelines (and in which the first two participants are the two automated pipelines), we can see that we only need to aggregate 3 participants to the domainspecific pipeline to exceed its performance.

Other Domains
431 participants in the High Recall Group and 120 participants in the High F 1 Group labelled 180 sentences.
Table 4 shows the results for both configurations of the pipeline (highest score marked in bold).We can see that operating out of their original domains, the automated pipelines, while still outperforming the Stanford pipeline by around 4 percentage points, do not outperform aggregated users.
However, they do appear to serve well as agents to train participants to perform annotations, as participants annotate to a high level of accuracy.5

Error Analysis
We analysed the errors produced both before and after aggregation.There were many errors to consider, so we took an approximate rule driven approach to characterise as many as possible.
Before aggregation, by far the most common error (1254 cases) is participants marking individual nouns as noun phrases (e.g., marking the [cat] instead of [the cat]).This suggests a misunderstanding of what a noun phrase is that may possibly be addressed by improvements to the tutorial.Similarly, in 606 cases participants mark named entities/strings of proper nouns rather than the encapsulating noun phrase.
The next most common error (529 cases) is annotators neglecting to include post-modifiers when selecting noun phrase boundaries (e.g., marking [the cat] in the hat instead of [the cat in the hat].This is often the most popular judgement, and as such, chosen by MV.A real example of this is in Figure 6: whilst five annotators did identify the correct boundaries (in green), matching the gold standard (in gold), more (six), only marked the reduced boundaries (in red) "A consortium of private investors".This sequence, missing the postmodifier, was consequently chosen by majority voting.The probabilistic method (in silver), however, expressed more confidence in the five annotators and provided a correct final judgement.where "ten books" and "the library" should not be coordinated.
Another common problem for automated mention detectors was prepositional phrase attachment.Our automated mention detectors tend to prefer low attachment, as in So John and Caroline filled up a [green bin with mandarins].
The example above highlights another common error with the mention detectors, missing the determiner -most commonly, quantifiers and indefinite articles.
Lastly, proper nouns near the start of sentences are often incorrectly grouped with the capitalized first token which is incorrectly also identified as a proper noun (e.g.The Wordrobe suite of games (Bos et al., 2017) includes multiple games that go on to produce similar annotations to that of TileAttack (e.g.Named Entity Recognition).However, all tasks are represented by a single common multiple choice format.targets a single yet core NLP annotation task (sequence labelling) with a broad set of applications.We do not constrain user input based on any prior judgement beyond tokenisation.

Aggregating markable annotations
Whilst there has been a great deal of work and evaluation on aggregating judgements from noisy crowdsourced data, this is generally focused on classification-based annotations (Sheshadri and Lease, 2013) and does not generalise to sequence labelling tasks like NER markable annotation.Dredze et al. proposed both a "Multi-CRF" approach to aggregating noisy sequence labels, and including judgements provided by an automated pipeline, in a NER task (Dredze et al., 2009).Confidence in annotators is not modelled.However, it has been extended to incorporate the reliability of the annotator with a similar method that also combines Expectation Maximization with CRF in an NER and NP chunking task (Rodrigues et al., 2014).IE, including a crowd component in both models representing each annotators ability for each label class (Nguyen et al., 2017).Whilst variations of CRF and HMM have demonstrated a great improvement over majority voting approaches, models to date have not taken into account the nested nature of sequences that occur in tasks such as markable identification.

Discussion and Conclusions
In this paper, we presented a hybrid mention detection method combining state-of-the-art automatic mention detectors with a gamified, twoplayer interface to collect markable judgements.The integration takes place by using the automatic mention detectors as 'players' in the game.Data from automatic mention detectors and players are then aggregated using a probabilistic aggregation method choosing the most likely interpretation in a nominal head-centered cluster.
We showed that using this combination we can achieve, in the news domain, an accuracy at mention identification that is almost three percentage points higher than that obtained with an automatic domain-trained mention detector, and over seven percentage points higher than that obtained with a domain-independent one.We also test the approach in genres outside those in which the automatic pipelines were trained, showing that high accuracy can be achieved in these as well.These results suggest that it may be possible to gamify not just the task of annotating coreference, but also the prerequisite steps to that.
As a rule of thumb, of the two best-known forms of crowdsourcing, microtask crowdsourcing us-ing platforms such as Amazon Mechanical Turk is best to label small to medium size amounts of data in a short time, and for labelling data of no intrinsic interest.Whereas crowdsourcing with gameswith-a-purpose is best in cases when the objective is to collect very large amounts of labels, so that the initial costs for setting up the game can be offset by the reduced costs of labelling (Poesio et al., 2013).One example in point is the Phrase Detectives annotation effort.The latest release of these data (Poesio et al., 2019) contains 2.2M judgments, around 4 times the number of judgments collected for ONTONOTES.The approach to mention detection proposed in this paper was developed in support of games such as Phrase Detectives, thus a GWAP or at least gamified approach as exemplified by TileAttack was deemed more appropriate even if the judgments used in this paper were collected using Amazon Mechanical Turk for speed.About 5,000 sentences were annotated by regular (i.e., not paid) players in this initial development phase, but we expect the game will be able to collect a comparable amount of judgments as for Phrase Detectives once it's fully operational and properly advertised.Anda gamified interface such as TileAttack can be beneficial even for projects who just use microtask crowdsourcing, as it has been shown that gamified HITs are more popular (Morschheuser et al., 2017).

Figure 2 :
Figure 2: Finding a head for a proposed boundary 4

Figure
Figure 3: Experiment Setup only the participants are shown.The Figure shows that in this genre the domain-specific automated pipeline (trained on this domain) outperforms the participants, but already at five annotators, aggregated with the probabilistic aggregation method, we are very close to the performance of the domain spe-

Figure 6 :
Figure 6: Example of post-modifier phrase [First Art] sat in the car... rather than First [Art] sat in the car...) 8 Related Work 8.1 Gamifying all steps of a pipeline The Groningen Meaning Bank project includes multiple gamified interfaces as part of a platform called Wordrobe.These gamified interfaces are supported by prior judgements provided by an automated NLP pipeline and the GMB Explorer (Basile et al., 2012).

Table 2 :
Regular players accuracy on 'Other domains'

Table 3 :
Comparing pipeline and aggregation methods

Table 4 :
Nguyen et al.apply HMM and LSTM methods to aggregating judgements in NER and Results on the 'Other Domains' dataset (rounded to 3 dp)