Dissecting Span Identification Tasks with Performance Prediction

Span identification (in short, span ID) tasks such as chunking, NER, or code-switching detection, ask models to identify and classify relevant spans in a text. Despite being a staple of NLP, and sharing a common structure, there is little insight on how these tasks' properties influence their difficulty, and thus little guidance on what model families work well on span ID tasks, and why. We analyze span ID tasks via performance prediction, estimating how well neural architectures do on different tasks. Our contributions are: (a) we identify key properties of span ID tasks that can inform performance prediction; (b) we carry out a large-scale experiment on English data, building a model to predict performance for unseen span ID tasks that can support architecture choices; (c), we investigate the parameters of the meta model, yielding new insights on how model and task properties interact to affect span ID performance. We find, e.g., that span frequency is especially important for LSTMs, and that CRFs help when spans are infrequent and boundaries non-distinctive.


Introduction
Span identification is a family of analysis tasks that make up a substantial portion of applied NLP. Span identification (or short, span ID) tasks have in common that they identify and classify contiguous spans of tokens within a running text. Examples are named entity recognition (Nadeau and Sekine, 2007), chunking (Tjong Kim Sang and Buchholz, 2000), entity extraction (Etzioni et al., 2005), quotation detection (Pareti, 2016), keyphrase detection (Augenstein et al., 2017), or code switching (Pratapa et al., 2018). In terms of complexity, span ID tasks form a middle ground between simpler analysis tasks that predict labels for single linguistic units (such as lemmatization (Porter, 1980) or sentiment polarity classification (Liu, 2012)) and more complex analysis tasks such as relation extraction, which combines span ID with relation identification (Zelenko et al., 2002;Adel et al., 2018).
Due to the rapid development of deep learning, an abundance of model architectures is available for the implementation of span ID tasks. These include isolated token classification models (Berger et al., 1996;Chieu and Ng, 2003), probabilistic models such as hidden Markov models (Rabiner, 1989), maximum entropy Markov models (McCallum et al., 2000), and conditional random fields (Lafferty et al., 2001), recurrent neural networks such as LSTMs (Hochreiter and Schmidhuber, 1997), and transformers such as BERT (Devlin et al., 2019).
Though we have some understanding what each of these models can and cannot learn, there is, to our knowledge, little work on systematically understanding how different span ID tasks compare: are there model architectures that work well generally? Can we identify properties of span ID tasks that can help us select suitable model architectures on a taskby-task basis? Answers to these questions could narrow the scope of architecture search for these tasks, and could help with comparisons between existing methods and more recent developments.
In this work, we address these questions by applying meta-learning to span identification (Vilalta and Drissi, 2002;Vanschoren, 2018). Metalearning means "systematically observing how different machine learning approaches perform [. . . ] to learn new tasks much faster" (Vanschoren, 2018), with examples such as architecture search (Elsken et al., 2019) and hyperparameter optimization (Bergstra and Bengio, 2012). Our specific approach is to apply performance prediction for span ID tasks, using both task properties and model architectures as features, in order to obtain a better understanding of the differences among span ID tasks.
Concretely, we collect a set of English span ID tasks, quantify key properties of the tasks (such as how distinct the spans are from their context, and how clearly their boundaries are marked) and formulate hypotheses linking properties to performance (Section 2). Next, we describe relevant neural model architectures for span ID (Section 3). We then train a linear regressor as a meta-model to predict span ID performance based on model features and task metrics in an unseen-task setting (Section 4). We find the best of these architectures perform at or close to the state of the art, and their success can be relatively well predicted by the meta-model (Section 5). Finally, we carry out a detailed analysis of the regression model's parameters (Section 6), gaining insight into the relationship between span ID tasks and different neural model architectures. For example, we establish that spans that are not very distinct from their context are consistently difficult to identify, but that CRFs are specifically helpful for this class of span ID tasks.

Datasets, Span Types, and Hypotheses
We work with five widely used English span ID datasets. All of them have non-overlapping spans from a closed set of span types. In the following, we discuss (properties of) span types, assuming that each span type maps onto one span ID task.

Datasets
Quotation Detection: PARC and RIQUA. The Penn Attribution Relation Corpus (PARC) version 3.0 (Pareti, 2016) and the Rich Quotation Attribution Corpus (RIQUA, Papay and Padó, 2020) are two datasets for quotation detection: models must identify direct and indirect quotation spans in text, which can be useful for social network construction (Elson et al., 2010) and coreference resolution (Almeida et al., 2014). The corpora cover articles from the Penn Treebank (PARC) and 19th century English novels (RIQUA), respectively. Within each text, quotations are identified, along with each quotation's speaker (or source), and its cue (an introducing word, usually a verb like "said"). We model detection of quotations as well as cues. As speaker and addressee identification are relation extraction tasks, we exclude these span types.  Krallinger et al., 2015). OntoNotes, a general language NER corpus, is our largest dataset, with over 2.2 million tokens. The NER layer comprises 18 span types, both typical entity types such as 'Person' and 'Organization' as well as numerical value types such as 'Date' and 'Quantity'. We use all span types. ChemDNer is a NER corpus specific to chemical and drug names, comprising titles and abstracts from 10000 PubMed articles. It labels names of chemicals and drugs and assigns them to eight classes, corresponding to chemical name nomenclatures. We use seven span types: 'Abbreviation', 'Family', 'Formula', 'Identifier', 'Systematic', 'Trivial', and 'Multiple'. We exclude the class 'No class' as infrequent (<100 instances).

Span Type Properties and Hypotheses
While quotation detection, chunking, and named entity recognition are all span ID tasks, they vary quite widely in their properties. As mentioned in the introduction, we know of little work on quantifying the similarities and differences of span types, and thus, span ID tasks. We now present four metrics which we propose to capture the relevant characteristics of span types, and make concrete our hypotheses regarding their effect on model performance. Table 1 reports frequency-weighted averages for each metric on each dataset. See Table 7 in the Appendix for all span-type-specific values.
Frequency is the number of spans for a span type in the dataset's training corpus. It is well established that the performance of a machine learning model benefits from higher amounts of training data (Halevy et al., 2009). Thus, we expect this property to be positively correlated with performance. However, some architectural choices, such as the use of transfer learning, are purported to reduce the data requirements of machine learning models (Pan and Yang, 2009), so we expect a smaller correlation for architectures which incorporate transfer learning.
Span length is the geometric mean of spans' lengths, in tokens. Scheible et al. (2016) note that traditional CRF models perform poorly at the identification of long spans due to the strict Markov assumption they make (Lafferty et al., 2001). Thus, we expect architectures which rely on such assumptions and which have no way to model long distance dependencies to perform poorly on span types with a high average span length, while LSTMs or transformers should do better on long spans (Khandelwal et al., 2018;Vaswani et al., 2017).
Span distinctiveness is a measure of how distinctive the text that comprises spans is compared to the overall text of the corpus. Formally, we define it as the KL divergence D KL (P span ||P ), where P is the unigram word distribution of the corpus, and P span is the unigram distribution of tokens within a span. A high span distinctiveness indicates that different words are used inside spans compared to the rest of the text, while a low span distinctiveness indicates that the word distribution is similar inside and outside of spans.
We expect this property to be positively correlated with model performance. Furthermore, we hypothesize that span types with a high span distinctiveness should be able to rely more heavily on local features, as each token carries strong information about span membership, while low span distinctiveness calls for sequence information. Consequently, we expect that architectures incorporating sequence models such as CRFs, LSTMs, and transformers should perform better at low-distinctive span types.
Boundary distinctiveness is a measure of how distinctive the starts and ends of spans are. We formalize this in terms of a KL-divergence as well, namely as D KL (P boundary ||P ) between the unigram word distribution (P ) and the distribution of boundary tokens (P boundary ), where boundary tokens are those which occur immediately before the start of a span, or immediately after the end of a span. A high boundary distinctiveness indicates that the start and end points of spans are easy to spot, while low distinctiveness indicates smooth transitions.
We expect boundary distinctiveness to be positively correlated with model performance, based on studies that obtained improvements from specifically modeling the transition between span and context (Todorovic et al., 2008;Scheible et al., 2016). As sequence information is required to utilize boundary information, high boundary distinctiveness should improve performance more for LSTMs, CRFs, or transformers.
Task Profiles. As Table 1 shows, the metrics we propose appear to capture the task structure of the datasets well: quotation corpora have long spans with low span distinctiveness (anything can be said) but high boundary distinctiveness (punctuation, cues). Chunking has notably low boundary distinctiveness, due to the syntactic nature of the span types, and NER spans show high distinctiveness (semantic classes) but are short and have somewhat indistinct boundaries as well.

Model Architectures
For span identification, we use the BIO framework (Ramshaw and Marcus, 1999), framing span identification as a sequence labeling task. As each span type has its own B and I labels, and there is one O label, a dataset with n span types leads to a 2n + 1label classification problem for each token.
We investigate a set of sequence labeling models, ranging from baselines to state-of-the-art architectures. We group our models by common components, and build complex models through combination of simpler models. Except for the models using BERT, all architectures assume one 300-dimensional GloVe embedding (Pennington et al., 2014) per token as input.
Baseline. As a baseline model, we use a simple token-level classifier. This architecture labels each token using a softmax classifier without access to sequence information (neither at the label level nor at the feature level). CRF. This model uses a linear-chain conditional random field (CRF) to predict token label se-quences (Lafferty et al., 2001). It can access neighboring labels in the sequence of predictions.
LSTM and LSTM+CRF. These architectures incorporate Bi-directional LSTMs (biLSTMs) (Hochreiter and Schmidhuber, 1997;Schuster and Paliwal, 1997) as components. The simplest architecture, LSTM, passes the inputs through a 2-layer biLSTM network, and then predicts token labels using a softmax layer. The LSTM+CRF architecture combines the biLSTM network with a CRF layer, training all weights simultaneously. These models can learn to combine sequential input and labeling information.
BERT and BERT+CRF. These architectures include the pre-trained BERT language model (Devlin et al., 2019) as a component. The simplest architecture in this category, BERT, comprises a pre-trained BERT encoder and a softmax output layer, which is trained while the BERT encoder is fine-tuned. BERT+CRF combines a BERT encoder with a linear-chain CRF output layer, which directly uses BERT's output embeddings as inputs. In this architecture, the CRF layer is first trained to convergence while BERT's weights are held constant, and then both models are jointly fine-tuned to convergence. As BERT uses WordPiece tokenization (Wu et al., 2016), the input must be retokenized for BERT architectures.
BERT+LSTM+CRF. This architecture combines all components previously mentioned. It first uses a pre-trained BERT encoder to generate a sequence of contextualized embeddings. These embeddings are projected to 300 dimensions using a linear layer, yielding a sequence of vectors, which are then used as input for a LSTM+CRF network. As with BERT+CRF, we first train the non-BERT parameters to convergence while holding BERT's parameters fixed, and subsequently fine-tune all parameters jointly.
Handcrafted Features. Some studies have shown marked increases in performance by adding hand-crafted features (e.g. Shimaoka et al., 2017). We develop such features for our tasks and treat these to be an additional architecture component. For architectures with this component, a bag of features is extracted for each token (the exact features used for each dataset are enumerated in Table  5 in the Appendix). For each feature, we learn a 300-dimensional feature embedding which is av-eraged with the GloVe or BERT embedding to obtain a token embedding. Handcrafted features can be used with the Baseline, LSTM, LSTM+CRF, and BERT+LSTM+CRF architectures. BERT and BERT+CRF cannot utilize manual features, as they have no way of accepting token embeddings as input.

Meta-learning Model
Recall that our meta-learning model is a model for predicting the performance of the model architectures from Section 3 when applied to span identification tasks from Section 2. We model this task of performance prediction as linear regression, a well established framework for the statistical analysis of language data (Baayen, 2008). The predictors are task properties, model architecture properties, and their interactions, and the dependent variable is (scaled) F 1 score. While a linear model is not powerful enough to capture the full range of interactions, its weights are immediately interpretable, it can be trained on limited amounts of data, and it does not overfit easily (see Section 5.1). All three properties make it a reasonable choice for meta-learning.
Predictors and Interactions. As predictors for our performance prediction task, we use the span type properties described above, and a number of binary model properties. For the span type properties [freq] and [span length], we use the logarithms of these values as predictors. The two distinctiveness properties are already logarithms, and so we used them as-is. For model properties, we used four binary predicates: The presence of handcrafted features, of a CRF output layer, of a bi-LSTM layer, and of a BERT layer.
In addition to main effects of properties of models and corpora on performance (does a CRF layer help?), we are also interested in interactions of these properties (does a CRF layer help in particular for longer spans?). As such interactions are not captured automatically in a linear regression model, we encode them as predictors. We include interactions between span type and model properties, as well as among model properties.
All predictors (including interactions) are standardized so as to have a mean of zero and standard deviation of one.
Scaling the Predicted Performance Instead of directly predicting the F 1 score, we instead make our predictions in a logarithmic space, which eases the linearity requirements of linear regression. We cannot directly use the logit function to transform F 1 scores into F = logit F 1 100 since the presence of zeros in our F 1 scores makes this process ill-defined. Instead, we opted for a "padded" logit transformation with a hyperparameter α ∈ [0, 0.5). This rescales the argument of the logit function from [0, 1] to the smaller interval [α, 1−α], avoiding the zero problem of a bare logit. Through cross-validation (cf. Section 5.1), we set α = 0.2. We use the inverse of this transformation to scale the output of our prediction as an F 1 score, clamping the result to [0, 100].

Experimental Procedure
Our meta learning experiment comprises two steps: Span ID model training, and meta model training.
Step 1: Span ID model training. We train and subsequently evaluate each model architecture on each dataset five times, using different random initializations. With 12 model architectures and 5 datasets under consideration, this procedure yields 12 × 5 × 5 = 300 individual experiments.
For each dataset, we use the established train/test partition. Since RIQUA does not come with such a partition, we use cross-validation, partitioning the dataset by its six authors and holding out one author per cross-validation step.
We use early stopping for regularization, stopping training once (micro-averaged) performance on a validation set reaches its maximum. To prevent overfitting, all models utilize feature dropout -during training, each feature in a token's bag of input features is dropped with a probability of 50%. At evaluation time, all features are used.
Step 2: Meta learning model training. This step involves training our performance prediction model on the F 1 scores obtained from the first step. For each architecture-span-type pair of the 12 model architectures and 36 span types, we already obtained 5 F 1 scores. This yields a total of 12 × 36 × 5 = 2160 input-output pairs to train our performance prediction model.
We investigate both L 1 and L 2 regularization in an elastic net setting (Zou and Hastie, 2005) but consistently find best cross-validation performance with no regularization whatsoever. Thus, we use ordinary least squares regression.
To ensure that our performance predictions generalize, we use a cross-validation setup when generating model predictions. To generate performance predictions for a particular span type, we train our meta-model on data from all other span types, holding out the span type for which we want a prediction. We repeat this for all 36 span types, holding out a different span type each time, in order to collect performance predictions for each span type.

Span Identification Results
Step 1 yields 5 evaluation F 1 scores for each architecture-span-type pair. This section summarizes the main findings. Detailed average scores for each pair are reported in Table 8 in the Appendix. Table 2 lists the micro-averaged performance of each model architecture on each dataset. Unsurprisingly, BERT+Feat+LSTM+CRF, the model with the most components, performs best on three of the five datasets. This provides strong evidence that this architecture can perform well across many tasks. However, note that architecture's dominance is somewhat overstated by only looking at average dataset results. Our analysis permits us to look more closely at results for individual span types, where we find that BERT+Feat+LSTM+CRF performs best on 16 of the 36 total span types, BERT+CRF on 7 span types, Feat+LSTM+CRF on 7 span types, and BERT+LSTM+CRF on 6 span types. Thus, 'bespoke' modeling of span types can evidently improve results.
Even though our architectures are task-agnostic, and not tuned to particular tasks or datasets, our best architectures still perform quite competitively. For instance, on CoNLL'00, our BERT+Feat+LSTM+CRF model comes within 0.12 F 1 points of the best published model's F 1 score of 97.62 (Akbik et al., 2018). For PARC, existing literature does not report micro-averaged F 1 scores, but instead focuses only on F 1 scores for content span detection. In this case, we find that our BERT+Feat+LSTM+CRF model beats the existing state of the art on this span type, achieving an F 1 score of 78.1, compared to the score of 75 reported in Scheible et al. (2016).

Meta-learning Results
The result of Step 2 is our performance prediction model.  Table 2: Average architecture results on datasets. BL=Baseline, Feat=Hand-crafted features. For each dataset, we micro-average performance over all span types, and average these micro-averages across five trials. For comparability with existing work, we include all span types in these micro-averages, even those which we exclude from our performance prediction. Full performance results for each span type can be found in Table 8  difference between predicted and actual F 1 score for each data point, and r 2 , which provides the amount of variance accounted for by the model. The full performance prediction model, including both span type and model architecture features, accounts for 73% of the variance, with an MAE of about 11. We see this as an acceptable model fit.
To validate the usefulness of the predictor groups and interaction terms, we carry out ablation experiments wherein these are excluded, including a model with no interaction terms, a model with only span type-predictors, a model with only architecture predictors, and an empty model, which only predicts the average of all F 1 scores. The reduced models do better than the empty model, 1 but show marked increases in MAE and corresponding drops in r 2 compared to the full model. While the usefulness of the architecture predictors is expected, this also constitutes strong evidence for the usefulness of the span type predictors we have proposed in Section 2. Figure 1 shows a scatterplot of predicted and actual F 1 scores. Our meta learning model generally predicts high performances better than low performances. The largest cluster of errors occurs 1 For the empty model, r 2 is undefined because the variance of the predictions is zero. for experiments with an actual F 1 score of exactly zero, arguably an uninteresting case. Thus, we believe that the overall MAE underestimates rather than overestimates the quality of the performance prediction for practical purposes.

Analysis
We now investigate the linear regression coefficients of our performance prediction model to assess our hypotheses from Section 2. To obtain a single model to analyze, we retrain our regression model on all data points, with no cross-validation. Table 4 shows the resulting coefficients. Using Bonferroni correction at α = 0.05, we consider a coefficient significant if p<0.002. Non-significant coefficients are shown in parentheses. Due to the scaling of F 1 scores performed as described in section 4, the coefficients cannot be directly interpreted in terms of linear change on the F 1 scale. However, as we standardized all predictors, we can compare coefficients with one another. Coefficients with a greater magnitude have larger effects on F 1 score, with positive values indicating an increase, and negative values a decrease.
When analyzing these coefficients, one must consider main effects and interactions together. E.g., the main effect coefficient for LSTMs is negative, which seems to imply that adding an LSTM will hurt performance. However, the LSTM × [freq] and LSTM × [boundary distinctness] interactions are both strongly positive, so LSTMs should help on frequent span types with high boundary distinctiveness. Our main observations are the following: Frequency helps, length hurts. The main effects of our span type predictors show mostly an expected pattern. Frequency has a strong positive effect (frequent span types are easier to learn), while length has an even stronger negative effect (long span types are difficult). More distinct boundaries help performance as well. More surprising is the negative sign of the span distinctiveness predictor, which would mean that more distinct spans are more difficult to recognize. However, this might be due to the negative correlation between span distinctiveness and frequency (r = −0.46 in standardized predictors) -less frequent spans are, by virtue of their rarity, more distinctive.
BERT is good for performance, especially with few examples. The presence of a BERT component is the highest-impact positive predictor for model performance, with a positive coefficient of 1. This finding is not entirely surprising, given the recent popularity of BERT-based models for span identification problems (Li et al., 2020;Hu et al., 2019). Furthermore, the strong negative value of the (BERT × [freq]) predictor shows that BERT's benefits are strongest when there are few training examples, validating our hypothesis about transfer learning. BERT is also robust: largely independent of span or boundary distinctiveness effects.
LSTMs require a lot of data. While the main effect of LSTMs is negative, this effect is again modulated by the high positive coefficient of the (LSTM × [freq]) interaction. This means that their  performance is highly dependent on the amount of training data. Also, LSTMs lead to improvements for long span types and those with distinct boundaries -properties that LSTMs arguably can pick up well but that other models struggle with.
CRFs help. After BERT, the presence of a CRF shows the second-most positive main effect on model performance. Given the strong correlation between adjacent tags in a BIO label sequence, it makes sense that a model capable of enforcing correlations in its output sequence would perform well. CRFs can also exploit span distinctiveness well, presumably by the same mechanism. Surprisingly, CRFs show reduced effectiveness for highly frequent spans with distinct boundaries. We believe that this pattern is best considered as a relative statement: for frequent, well-separated span types CRFs gain less than other model types.
Handcrafted features do not matter much. We find neither a significant main effect of handcrafted features, nor any significant interactions with span type predictors. Interactions with model predictors are significant, but rather small. While a detailed analysis of architecture-wise F 1 -scores does show that some architectures, such as pure CRFs, do seem to benefit more from hand-crafted features (see Table 8 in the Appendix), this effect diminishes considerably when model components are mixed.
Combining model components shows diminishing returns. All interactions between LSTM, CRF, and BERT are negative. This demonstrates an overlap in these components' utility. Thus, a simple "maximal" combination does not always perform best, as Table 2 confirms.

Related Work
Meta-learning and performance prediction are umbrella terms which comprise a variety of approaches and formalisms in the literature. We focus on the literature most relevant to our work and discuss the relationship.

Performance Prediction for Trained Models.
In NLP, a number of studies investigate predicting the performance of models that have been trained previously on novel input. An example is Chen (2009) which develops a general method to predict the performance of a family of language models. Similar ideas have been applied more recently to machine translation (Bojar et al., 2017), and automatic speech recognition (Elloumi et al., 2018), among others. While these approaches share our goal of performance prediction, they predict performance for the same task and model on new data, while we generalize across tasks and architectures.
Thus, these approaches are better suited to estimating confidence at prediction time, while our meta-learning approach can predict a model's performance before it is trained. Other works seek to explain and summarize how models perform across an entire dataset. This can be achieved e.g. through comparison of architecture performances, as in Nguyen and Guo (2007), or through meta-modeling of trained models, as was done in Weiss et al. (2018). Our present work falls into this category, including both a comparison of architectures across datasets and a meta-learning task of model performance.

Meta-learning for One-and Few-shot Learning.
A recent trend is the application of meta-learning to models for one-or few-shot learning. In this setting, a meta-learning approach is used to train models on many distinct tasks, such that they can subsequently be rapidly fine-tuned to a particular task (Finn et al., 2017;Santoro et al., 2016). While such approaches use the same meta-learning framework as we do, their task and methodology are substantially different. They focus on learning with very few training examples, while we focus on optimizing performance with normally sized corpora. Additionally, these models selectively train preselected model architectures, while we are concerned with comparisons between architectures.
Model and Corpus Comparisons in Survey Papers. In a broad sense, our goal of comparison between existing corpora and modeling approaches is shared with many existing survey papers. Surveys include quantitative comparisons of existing systems' performances on common tasks, producing a results matrix very similar to ours (Li et al., 2020;Yadav and Bethard, 2018;Bostan and Klinger, 2018, i.a.). However, most of these surveys limit themselves to collecting results across models and datasets without performing a detailed quantitative analysis of these results to identify recurring patterns, as we do with our performance prediction approach.

Conclusion
In this work, we considered the class of span identification tasks. This class contains a number of widely used NLP tasks, but no comprehensive analysis beyond the level of individual tasks is available. We took a meta-learning perspective, predicting the performance of various architectures on various span ID tasks in an unseen-task setup. Using a number of 'key metrics' that we developed to characterize the span ID tasks, a simple linear regression model was able to do so at a reasonable accuracy. Notably, even though BERT-based architectures expectedly perform very well, we find that different variants are optimal for different tasks. We explain such patterns by interpreting the parameters of the regression model, which yields insights into how the properties of span ID tasks interact with properties of neural model architectures. Such patterns can be used for manual fine-grained model selection, but our meta-learning model could also be incorporated directly into AutoML systems. Our current study could be extended in various directions. First, the approach could apply the same meta-learning approach to other classes of tasks beyond span ID. Second, a larger range of span type metrics could presumably improve model fit, albeit at the cost of interpretability. Third, we only predict within-corpus performance, and corpus-level similarity metrics could be added to make predictions about performance in transfer learning.

A Training Models
All code used for training span identification and performance prediction models is available for download at our project website: https://www.ims.uni-stuttgart. de/data/span-id-meta-learning. All text logs generated during training of span identification models are included.

A.1 Hardware
All span identification models were trained using GeForce GTX 1080 Ti GPUs. Training time varied considerably across architectures -exact training times for individual experiments are found in the corresponding training logs. The performance prediction model was trained on a CPU in a few seconds.

A.2 Tokenization
For PARC, OntoNotes, and CoNLL'00, which include tokenization information, and we use the datasets' tokenizations directly For RIQUA, we use spaCy (Honnibal and Montani, 2017) to wordtokenize the text. We found that spaCy's tokenization performed particularly poorly for ChemDNer, and so for this corpus we treated all sequences of alphabetic characters as a token, all sequences of numbers as a token, and all other characters as single-character tokens. For ChemDNer, we found that some spans within the corpus still did not align with token boundaries. In these cases, we excluded the spans entirely from the training data, and treated them as an automatic false-negative for evaluation purposes.
For models including a BERT component, tokens were sub-tokenized using word-piece tokenization (Wu et al., 2016) so as to be compatible with BERT. The same bag of token features was given to each word piece. Models predicted BIO sequences for these sub-tokens, and spans were only evaluated as correct when their boundaries matched exactly with the originally-tokenized corpus.

A.3 Hyperparameters
Due to the large number of experiments run, it was infeasible to do a full grid-search for hyperparameters for each architecture-dataset combination. As such, we tried to pick reasonable values for hyperparameters, motivated by existing literature, prior Is token all digits?  tures, an initial learning rate of 0.001 was used. For BERT, and for the second training phase in the BERT+CRF and BERT+LSTM+CRF architectures, an initial learning rate of 2 × 10 −5 was used.

A.5 Early Stopping
To guide early stopping, micro-averaged F 1 scores on the development set were computed after every epoch. These were computed for all span types, including those which were subsequently excluded from our meta-model. For datasets which had no dedicated development partition, a portion of the training set was held out for this purpose. After each epoch, model parameters were saved to disk if the development-set F 1 score exceeded the best seen so far. An exponential moving average of these F 1 scores was kept, and training terminated when an epoch's F 1 score fell below this average. For BERT+CRF and BERT+LSTM+CRF, this same early stopping procedure was used for both training phases. The training logs list development set performance at each epoch for each experiment.

A.6 Features
Table 5 lists all manual features that were used in models with the "Feat" component.