Diversity-Aware Batch Active Learning for Dependency Parsing

While the predictive performance of modern statistical dependency parsers relies heavily on the availability of expensive expert-annotated treebank data, not all annotations contribute equally to the training of the parsers. In this paper, we attempt to reduce the number of labeled examples needed to train a strong dependency parser using batch active learning (AL). In particular, we investigate whether enforcing diversity in the sampled batches, using determinantal point processes (DPPs), can improve over their diversity-agnostic counterparts. Simulation experiments on an English newswire corpus show that selecting diverse batches with DPPs is superior to strong selection strategies that do not enforce batch diversity, especially during the initial stages of the learning process. Additionally, our diversity-aware strategy is robust under a corpus duplication setting, where diversity-agnostic sampling strategies exhibit significant degradation.


Introduction
Though critical to parser training, data annotations for dependency parsing are both expensive and time-consuming to obtain. Syntactic analysis requires linguistic expertise and even after extensive training, data annotation can still be burdensome. The Penn Treebank project (Marcus et al., 1993) reports that after two months of training, the annotators average 750 tokens per hour on the bracketing task; the Prague Dependency Treebank (Böhmová et al., 2003) cost over $600,000 and required 5 years to annotate roughly 90,000 sentences (over $5 per sentence). These high annotation costs present a significant challenge to developing accurate dependency parsers for under-resourced languages and domains.
Active learning (AL; Settles, 2009) is a promising technique to reduce the annotation effort required to train a strong dependency parser by intel- * Work done during an internship at Bloomberg L.P. ligently selecting samples to annotate such that the return of each annotator hour is as high as possible. Popular selection strategies, such as uncertainty sampling, associate each instance with a quality measure based on the uncertainty or confidence level of the current parser, and higher-quality instances are selected for annotation.
We focus on batch mode AL, since it is generally more efficient for annotators to label in bulk. While early work in AL for parsing (Tang et al., 2002;Hwa, 2000Hwa, , 2004 cautions against using individually-computed quality measures in the batch setting, more recent work demonstrates empirical success (e.g., Li et al., 2016) without explicitly handling intra-batch diversity. In this paper, we explore whether a diversity-aware approach can improve the state of the art in AL for dependency parsing. Specifically, we consider samples drawn from determinantal point processes (DPPs) as a query strategy to select batches of high-quality, yet dissimilar instances (Kulesza and Taskar, 2012).
In this paper, we (1) propose a diversity-aware batch AL query strategy for dependency parsing compatible with existing selection strategies, (2) empirically study three AL strategies with and without diversity factors, and (3) find that diversityaware selection strategies are superior to their diversity-agnostic counterparts, especially during the early stages of the learning process, in simulation experiments on an English newswire corpus. This is critical in low-budget AL settings, which we further confirm in a corpus duplication setting. 1 2 Active Learning for Dependency Parsing

Dependency Parsing
Dependency parsing (Kübler et al., 2008) aims to find the syntactic dependency structure, y, given a length-n input sentence x = x 1 , x 2 , . . . , x n , where y is a set of n arcs over the tokens and the dummy root symbol x 0 , and each arc (h, m) ∈ y specifies the head, h, and modifier word, m. 2 In this work, we adopt the conceptually-simple edge-factored deep biaffine dependency parser (Dozat and Manning, 2017), which is competitive with the state of the art in terms of accuracy, The parser assigns a locally-normalized attachment probability P att (head(m) = h | x) to each attachment candidate pair (h, m) based on a biaffine scoring function. Refer to Appendix A for architecture details.
We define the score of the candidate parse tree s(y | x) as (h,m)∈y log P att (head(m) = h | x). The decoder finds the best scoringŷ among all valid trees Y(x):ŷ = arg max y∈Y(x) s(y | x).

Active Learning (AL)
We consider the pool-based batch AL scenario where we assume a large collection of unlabeled instances U from which we sample a small subset at a time to annotate after each round to form an expanding labeled training set L (Lewis and Gale, 1994). We use the superscript i to denote the pool of instances U i and L i after the i-th round. L 0 is a small set of seed labeled instances to initiate the process. Each iteration starts with training a model M i based on L i . Next, all unlabeled data instances in U i are parsed by M i and we select a batch U to annotate based on some criterion The definition of the selection criterion C is critical. A typical strategy associates each unlabeled instance U i with a quality measure q i based on, for example, the model uncertainty level when parsing U i . A diversity-agnostic criterion sorts all unlabeled instances by their quality measures and takes the top-k as U for a budget k.

Quality Measures
We consider three commonly-used quality measures adapted to the task of dependency parsing, including uncertainty sampling, Bayesian active learning, and a representativeness-based strategy.
Average Marginal Probability (AMP) measures parser uncertainty (Li et al., 2016): where P mar is the marginal attachment probability 2 For clarity, here we describe unlabeled parsing. In our experiments, we train labeled dependency parsers, which additionally predict a dependency relation label l for each arc.
Bayesian Active Learning by Disagreement (BALD) measures the mutual information between the model parameters and the predictions. We adopt the Monte Carlo dropout-based variant (Gal et al., 2017;Siddhant and Lipton, 2018) and measure the disagreement among predictions from a neural model with K different dropout masks, which has been applied to active learning in NLP. We adapt BALD to dependency parsing by aggregating disagreement at a token level: where h k m denotes that (h k m , m) appears in the prediction given by the k-th model.
Information Density (ID) mitigates the tendency of uncertainty sampling to favor outliers by weighing examples by how representative they are of the entire dataset (Settles and Craven, 2008): where cosine similarity is computed from the averaged contextualized features ( §3.2).

Learning from Partial Annotations
We follow Li et al. (2016) and select tokens to annotate their heads instead of annotating full sentences. We first pick the most informative sentences and then choose p% tokens from them based on the token-level versions of the quality measures (e.g., marginal probability instead of AMP).

Selecting Diverse Samples
Near-duplicate examples are common in real-world data (Broder et al., 1997;Manku et al., 2007), but they provide overlapping utility to model training. In the extreme case, with a diversity-agnostic strategy for active learning, identical examples will be selected/excluded at the same time (Hwa, 2004). To address this issue and to best utilize the annotation budget, it is important to consider diversity. We adapt Bıyık et al. (2019) to explicitly model diversity using determinantal point processes (DPPs).

Determinantal Point Processes
A DPP defines a probability distribution over subsets of some ground set of elements (Kulesza, 2012). In AL, the ground set is the unlabeled pool U and a subset corresponds to a batch of instances U drawn from U. DPPs provide an explicit mechanism to ensure high-quality yet diverse sample selection by modeling both the quality measures and the similarities among examples. We adopt the L-ensemble representation of DPPs using the quality-diversity decomposition (Kulesza and Taskar, 2012) and parameterize the matrix L as L ij = q i φ i φ T j q j , where each q i ∈ R is the quality measure for U i and each φ i ∈ R 1×d is a d-dimensional vector representation of U i , which we refer to as U i 's diversity features. 3 The probability of selecting a batch B is given by P (B ⊆ U) ∝ det(L B ), where det(·) calculates the determinant and L B is the submatrix of L indexed by elements in B.
DPPs place high probability on diverse subsets of high-quality items. Intuitively, the determinant of L B corresponds to the volume spanned by the set of vectors {q i φ i | i ∈ B}, and subsets with larger q values and orthogonal φ vectors span larger volumes than those with smaller q values or similar φ vectors. We follow Kulesza (2012) and adapt their greedy algorithm for finding the approximate mode arg max B P (B ⊆ U). This algorithm is reproduced in Algorithm E1 in the appendix.

Diversity Features
We consider two possibilities for the diversity features φ. Each feature vector is unit-normalized.
Averaged Contextualized Features are defined as 1 n i x i , where x i is a contextualized vector of x i from the feature extractor used by the parser. By this definition, we consider the instances to be similar to each other when the neural feature extractor returns similar features such that the parser is likely to predict similar structures for these instances.
Predicted Subgraph Counts explicitly represent the predicted tree structure. To balance richness and sparsity, we count the labeled but unlexicalized subgraph formed by the grandparent, the parent and the token itself. Specifically, for each token m, we can extract a subgraph denoted by (r 1 , r 2 ), assuming the predicted dependency relation between its grandparent g and its parent h is r 1 , and the relation between h and m is r 2 . The parse tree for a length-n sentence contains n such subgraphs. We apply tf-idf weighting to discount

Setting
We perform experiments by simulating the annotation process using treebank data. We sample 128 sentences uniformly for the initial labeled pool and each following round selects 500 tokens for partial annotation. We run each setting five times using different random initializations and report the means and standard deviations of the labeled attachment scores (LAS). Appendix B has unlabeled attachment score (UAS) results.
Baselines While we construct our own baselines for self-contained comparisons, the diversityagnostic AMP (w/o DPP) largely replicates the state-of-the-art selection strategy of Li et al. (2016).
Implementation We finetune a pretrained multilingual XLM-RoBERTa base model (Conneau et al., 2020) as our feature extractor. 5 See Appendix E for implementation details. Table 1 compares LAS after 5 and 10 rounds of annotation. Our dependency parser reaches 95.64 UAS and 94.06 LAS, when trained with the full dataset (more than one million tokens). Training data collected from 30 annotation rounds (≈ 17,500 tokens) correspond to roughly 2% of the full dataset, but already support an LAS of up to 92 through AL. We find that diversity-aware strategies generally improve over their diversityagnostic counterparts. Even for a random selection strategy, ensuring diversity with a DPP is superior  to simple random selection. With AMP and BALD, our diversity-aware strategy sees a larger improvement earlier in the learning process. ID models representativeness of instances, and our diversityaware strategy adds less utility compared with other quality measures, although we do notice a large improvement after the first annotation round for ID: 82.40 ±.48 vs. 83.36 ±.54 (w/ DPP) -a similar trend to AMP and BALD, but at an earlier stage of AL. Figure 1 compares our two definitions of diversity features, and we find that predicted subgraph counts provide stronger performance than that of averaged contextualized features. We hypothesize this is due to the fact that the subgraph counts represent structures more explicitly, thus they are more useful in maintaining structural diversity in AL.

Intra-Batch Diversity
To quantify intra-batch diversity among the set of sentences B picked by the selection strategies, we adapt the measures used by Chen et al. (2018) and define intra-batch average distance (IBAD) and intra-batch minimal distance (IBMD) as follows: A higher value on these measures indicates better intra-batch diversity. Figure 2 compares diversityagnostic and diversity-aware sampling strategies using the two different diversity features. We confirm that DPPs indeed promote diverse samples in the selected batches, while intra-batch diversity naturally increases even for the diversity-agnostic strategies. Additionally, we observe that the benefits of DPPs are more prominent when using pre- dicted subgraph counts compared with averaged contextualized features. This can help explain the relative success of the former diversity features.
Corpus Duplication Setting In our qualitative analysis (Appendix C), we find that diversityagnostic selection strategies tend to select nearduplicate sentences. To examine this phenomenon in isolation, we repeat the training corpus twice and observe the effect of diversity-aware strategies. The corpus duplication technique has been previously used to probe semantic models (Schofield et al., 2017). Figure 3 shows learning curves for strategies under the original and corpus duplication settings. As expected, diversity-aware strategies consistently outperform their diversity-agnostic counterparts across both settings, while some diversityagnostic strategies (e.g., AMP) even underperform uniform random selection in the duplicated setting.
Interpreting the Effectiveness of Diversity-Agnostic Models Figure 4 visualizes the density distributions of the top 200 data instances by AMP over the diversity feature space reduced to two dimensions through t-SNE (van der Maaten and Hinton, 2008). During the initial stage of active learning, data with the highest quality measures are con- centrated within a small neighborhood. A diversityagnostic strategy will sample similar examples for annotation. After a few rounds of annotation and model training, the distribution of high quality examples spreads out, and an AMP selection strategy is likely to sample a diverse set of examples without explicitly modeling diversity. Our analysis corroborates previous findings (Thompson et al., 1999) that small annotation batches are effective early in uncertainty sampling, avoiding selecting many near-duplicate examples when intra-batch diversity is low, but a larger batch size is more efficient later in training once intra-batch diversity increases. Building on their approach, we flesh out a DPP treatment for AL for a structured prediction task, dependency parsing. Previously, Shen et al. (2018) consider named entity recognition but they report negative results for a diversityinducing variant of their sampling method.

Related Work
Due to the high annotation cost, AL is a popular technique for parsing and parse selection (Osborne and Baldridge, 2004). Recent advances focus on reducing full-sentence annotations to a subset of tokens within a sentence (Sassano and Kurohashi, 2010;Mirroshandel and Nasr, 2011;Majidi and Crane, 2013;Flannery and Mori, 2015;Li et al., 2016). We show that AL for parsing can further benefit from diversity-aware sampling strategies.
DPPs have previously been successfully applied to the tasks of extractive text summarization (Cho et al., 2019a,b) and modeling phoneme inventories (Cotterell and Eisner, 2017). In this work, we show that DPPs also provide a useful framework for understanding and modeling quality and diversity in active learning for NLP tasks.

Conclusion
We show that compared with their diversityagnostic counterparts, diversity-aware sampling strategies not only lead to higher data efficiency, but are also more robust under corpus duplication settings. Our work invites future research into methods, utility and success conditions for modeling diversity in active learning for NLP tasks.

Appendix A Dependency Parser
We adopt the deep biaffine dependency parser proposed by Dozat and Manning (2017). The parser is conceptually simple and yet competitive with stateof-the-art dependency parsers. The parser has three components: feature extraction, unlabeled parsing and relation labeler.
Feature Extraction For a length-n sentence x = x 0 , x 1 , x 2 , . . . , x n , where x 0 is the dummy root symbol, we extract contextualized features at each word position. In our experiments, we use a pretrained multilingual XLM-RoBERTa base model (Conneau et al., 2020), and fine-tune the feature extractor along with the rest of our parser: Unlabeled Parser The parser uses a deep biaffine attention mechanism to derive locallynormalized attachment probabilities for all potential head-dependent pairs:

Each word input to the XLM-RoBERTa model is
where MLP arc-head and MLP arc-dep are two multilayer perceptrons (MLPs) projecting x vectors into d arc -dimensional h vectors, [; 1] appends an element of 1 at the end of the vectors, and U arc ∈ R (d arc +1)×(d arc +1) is a bilinear scoring matrix. This component is trained with cross-entropy loss of the gold-standard attachments. During inference, we use the Chu-Liu-Edmonds algorithm (Chu and Liu, 1965;Edmonds, 1967) to find the spanning tree with the highest product of locally-normalized attachment probabilities.

Relation Labeler
The relation labeling component employs a similar deep biaffine scoring func-  Table B1: UAS after 5 and 10 rounds of annotation (roughly 5,000 and 7,000 training tokens respectively), comparing strategies with and without modeling diversity through DPP.
tion as the unlabeled parsing component: where each U rel r ∈ R (d rel +1)×(d rel +1) , and there are as many such matrices as the size of the dependency relation label set |R|. The relation labeler is trained using cross entropy loss on the gold-standard headdependent pairs. During inference, the labeling decision for each arc is made independently given the predicted unlabeled parse tree.

Appendix B Results with UAS Evaluation
We also evaluate different learning strategies based on unlabeled attachment scores (UAS), and the results are shown in Table B1. In line with LASbased experiments, we find that modeling diversity is more helpful during initial stages of learning. For ID, we observe this effect even earlier than the fifth round of annotation: 86.57 ±.44 vs. 87.40 ±.51 after the first annotation round.

Appendix C Sentence Selection Examples
In Table C2 we compare batches sampled by a diversity-aware selection strategy with a diversityagnostic one. We observe that by modeling diversity in the sample selection process, DPPs avoid selecting duplicate or near-duplicate sentences and thus the annotation budget can be maximally utilized. Figure D1 shows the learning curves for BALDbased selection strategies under a high corpus duplication setting where the corpus is repeated Sentences selected by AMP (highest-quality ones first):

Appendix D BALD under High Duplication Setting
Downgraded by Moody 's were Houston Lighting 's first -mortgage bonds and secured pollution -control bonds to single -A -3 from single -A -2 ; unsecured pollutioncontrol bonds to Baa -1 from single -A -3 ; preferred stock to single -A -3 from single -A -2 ; a shelf registration for preferred stock to a preliminary rating of single -A -3 from a preliminary rating of single -A -2 ; two shelf registrations for collateralized debt securities to a preliminary rating of single -A -3 from a preliminary rating of single -A -2 , and the unit 's rating for commercial paper to Prime -2 from Prime -1 . For a while in the 1970s it seemed Mr. Moon was on a spending spree , with such purchases as the former New Yorker Hotel and its adjacent Manhattan Center ; a fishing / processing conglomerate with branches in Alaska , Massachusetts , Virginia and Louisiana ; a former Christian Brothers monastery and the Seagram family mansion ( both picturesquely situated on the Hudson River ) ; shares in banks from Washington to Uruguay ; a motion picture production company , and newspapers , such as the Washington Times , the New York City Tribune ( originally the News World ) , and the successful Spanish -language Noticias del Mundo . → LONDON LATE EURODOLLARS : 8 11/16 % to 8 9/16 % one month ; 8 5/8 % to 8 1/2 % two months ; 8 5/8 % to 8 1/2 % three months ; 8 9/16 % to 8 7/16 % four months ; 8 1/2 % to 8 3/8 % five months ; 8 1/2 % to 8 3/8 % six months . → LONDON LATE EURODOLLARS : 8 3/4 % to 8 5/8 % one month ; 8 3/4 % to 8 5/8 % two months ; 8 11/16 % to 8 9/16 % three months ; 8 9/16 % to 8 7/16 % four months ; 8 1/2 % to 8 3/8 % five months ; 8 7/16 % to 8  . When a RICO TRO is being sought , the prosecutor is required , at the earliest appropriate time , to state publicly that the government 's request for a TRO , and eventual forfeiture , is made in full recognition of the rights of third parties -that is , in requesting the TRO , the government will not seek to disrupt the normal , legitimate business activities of the defendant ; will not seek through use of the relation -back doctrine to take from third parties assets legitimately transferred to them ; will not seek to vitiate legitimate business transactions occurring between the defendant and third parties ; and will , in all other respects , assist the court in ensuring that the rights of third parties are protected , through proceeding under RICO and otherwise . COMMERCIAL PAPER placed directly by General Motors Acceptance Sentences selected by AMP with diversity-inducing DPP: Downgraded by Moody 's were Houston Lighting 's first -mortgage bonds and secured pollution -control bonds to single -A -3 from single -A -2 ; unsecured pollutioncontrol bonds to Baa -1 from single -A -3 ; preferred stock to single -A -3 from single -A -2 ; a shelf registration for preferred stock to a preliminary rating of single -A -3 from a preliminary rating of single -A -2 ; two shelf registrations for collateralized debt securities to a preliminary rating of single -A -3 from a preliminary rating of single -A -2 , and the unit 's rating for commercial paper to Prime -2 from Prime -1 . 4 . When a RICO TRO is being sought , the prosecutor is required , at the earliest appropriate time , to state publicly that the government 's request for a TRO , and eventual forfeiture , is made in full recognition of the rights of third parties -that is , in requesting the TRO , the government will not seek to disrupt the normal , legitimate business activities of the defendant ; will not seek through use of the relation -back doctrine to take from third parties assets legitimately transferred to them ; will not seek to vitiate legitimate business transactions occurring between the defendant and third parties ; and will , in all other respects , assist the court in ensuring that the rights of third parties are protected , through proceeding under RICO and otherwise . COMMERCIAL PAPER placed directly by General Motors Acceptance Moreover , the process is n't without its headaches . For a while in the 1970s it seemed Mr. Moon was on a spending spree , with such purchases as the former New Yorker Hotel and its adjacent Manhattan Center ; a fishing / processing conglomerate with branches in Alaska , Massachusetts , Virginia and Louisiana ; a former Christian Brothers monastery and the Seagram family mansion ( both picturesquely situated on the Hudson River ) ; shares in banks from Washington to Uruguay ; a motion picture production company , and newspapers , such as the Washington Times , the New York City Tribune ( originally the News World ) , and the successful Spanish -language Noticias del Mundo . Within the paper sector , Mead climbed 2 3/8 to 38 3/4 on 1.3 million shares , Union Camp rose 2 3/4 to 37 3/4 , Federal Paper Board added 1 3/4 to 23 7/8 , Bowater gained 1 1/2 to 27 1/2 , Stone Container rose 1 to 26 1/8 and Temple -Inland jumped 3 3/4 to 62 1/4 . We finally rendezvoused with our balloon , which had come to rest on a dirt road amid a clutch of Epinalers who watched us disassemble our craft -another half -an -hour of non-flight activity -that included the precision routine of yanking the balloon to the ground , punching all the air out of it , rolling it up and cramming it and the basket into the trailer . It is the stuff of dreams , but also of traumas . An inquiry into his handling of Lincoln S&L inevitably will drag in Sen. Cranston and the four others , Sens. Dennis DeConcini ( D. , Ariz. ) , John McCain ( R. , Ariz. ) , John Glenn ( D. , Ohio ) and Donald Riegle ( D. , Mich . ) . Five officials of this investment banking firm were elected directors : E. Garrett Bewkes III , a 38 -year -old managing director in the mergers and acquisitions department ; Michael R. Dabney , 44 , a managing director who directs the principal activities group which provides funding for leveraged acquisitions ; Richard Harriton , 53 , a general partner who heads the correspondent clearing services ; Michael Minikes , 46 , a general partner who is treasurer ; and William J. Montgoris , 42 , a general partner who is also senior vice president of finance and chief financial officer . But as they hurl fireballs that smolder rather than burn , and relive old duels in the sun , it 's clear that most are there to make their fans cheer again or recapture the camaraderie of seasons past or prove to themselves and their colleagues that they still have it -or something close to it . It is no coincidence that from 1844 to 1914 , when the Bank of England was an independent private bank , the pound was never devalued and payment of gold for pound notes was never suspended , but with the subsequent nationalization of the Bank of England , the pound was devalued with increasing frequency and its use as an international medium of exchange declined . The $ 4 billion in bonds break down as follows : $ 1 billion in five -year bonds with a coupon rate of 8.25 % and a yield to maturity of 8.33 % ; $ 1 billion in 10 -year bonds with a coupon rate of 8.375 % and a yield to maturity of 8.42 % ; $ 2 billion in 30 -year bonds with five -year call protection , a coupon rate of 8.75 % and a yield to maturity of 9.06 % . Hecla Mining rose 5/8 to 14 ; Battle Mountain Gold climbed 3/4 to 16 3/4 ; Homestake Mining rose 1 1/8 to 16 7/8 ; Lac Minerals added 5/8 to 11 ; Placer Dome went up 7/8 to 16 3/4 , and ASA Ltd. jumped 3 5/8 to 49 5/8 . Table C2: Sentences picked by a diversity-agnostic (top) and a diversity-aware (bottom) selection strategy from the same unlabeled pool after the intial round of model training on the seed sentences. The diversity-agnostic strategy selects many near-duplicate sentences (the two near-duplicate clusters are marked by red → and blue ), effectively wasting the annotation budget, where DPPs largely alleviate this issue by enforcing diversity. Algorithm E1: Greedy MAP inference for DPP with a size budget, adapted from Kulesza (2012).
Input: candidate item set X (sentences or tokens), DPP represented by matrix L, size budget b U ← X; Y ← ∅; break; end end Output: selected items Y five times. In this extreme setting, the diversityagnostic strategy significantly underperforms the diversity-aware one. We posit that the relative success of BALD compared to AMP in the twiceduplicated setting is due to the fact that BALD randomly draws dropout masks to estimate model uncertainty, so that identical examples could still have different quality measures.