HARE: a Flexible Highlighting Annotator for Ranking and Exploration

Exploration and analysis of potential data sources is a significant challenge in the application of NLP techniques to novel information domains. We describe HARE, a system for highlighting relevant information in document collections to support ranking and triage, which provides tools for post-processing and qualitative analysis for model development and tuning. We apply HARE to the use case of narrative descriptions of mobility information in clinical data, and demonstrate its utility in comparing candidate embedding features. We provide a web-based interface for annotation visualization and document ranking, with a modular backend to support interoperability with existing annotation tools. Our system is available online at https://github.com/OSU-slatelab/HARE.


Introduction
As natural language processing techniques become useful for an increasing number of new information domains, it is not always clear how best to identify information of interest, or to evaluate the output of automatic annotation tools. This can be especially challenging when target data in the form of long strings or narratives of complex structure, e.g., in financial data (Fisher et al., 2016) or clinical data (Rosenbloom et al., 2011).
We introduce HARE, a Highlighting Annotator for Ranking and Exploration. HARE includes two main components: a workflow for supervised training of automated token-wise relevancy taggers, and a web-based interface for visualizing and analyzing automated tagging output. It is intended to serve two main purposes: (1) triage of documents when analyzing new corpora for the presence of relevant information, and (2) interactive analysis, post-processing, and comparison of output from different annotation systems.
In this paper, we demonstrate an application of HARE to information about individuals' mo-bility status, an important aspect of functioning concerned with changing body position or location. This is a relatively new type of health-related narrative information with largely uncharacterized linguistic structure, and high relevance to overall health outcomes and work disability programs. In experiments on a corpus of 400 clinical records, we show that with minimal tuning, our tagger is able to produce a high-quality ranking of documents based on their relevance to mobility, and to capture mobility-likely document segments with high fidelity. We further demonstrate the use of post-processing and qualitative analytic components of our system to compare the impact of different feature sets and tune processing settings to improve relevance tagging quality.

Related work
Corpus annotation tools are plentiful in NLP research: brat (Stenetorp et al., 2012) and Knowtator (Ogren, 2006) being two heavily used examples among many. However, the primary purpose of these tools is to streamline manual annotation by experts, and to support review and revision of manual annotations. Some tools, including brat, support automated pre-annotation, but analysis of these annotations and corpus exploration is not commonly included. Other tools, such as Sci-KnowMine, 1 use automated techniques for triage, but for routing to experts for curation rather than ranking and model analysis. Document ranking and search engines such as Apache Lucene, 2 by contrast, can be overly fully-featured for earlystage analysis of new datasets, and do not directly offer tools for annotation and post-processing.
Early efforts towards extracting mobility information have illustrated that it is often syntactically and semantically complex, and difficult to extract reliably (Newman-Griffis and Zirikly, 2018;Newman-Griffis et al., 2019). Some characterization of mobility-related terms has been performed as part of larger work on functioning (Skube et al., 2018), but a lack of standardized terminologies limits the utility of vocabulary-driven clinical NLP tools such as CLAMP (Soysal et al., 2018)

System Description
Our system has three stages for analyzing document sets, illustrated in Figure 1. First, data annotated by experts for token relevance can be used to train relevance tagging models, and trained models can be applied to produce relevance scores on new documents (Section 3.1). Second, we provide configurable post-processing tools for cleaning and smoothing relevance scores (Section 3.2). Finally, our system includes interfaces for reviewing detailed relevance output, ranking documents by their relevance to the target criterion, and analyzing qualitative outcomes of relevance scoring output (Sections 3.3-3.5); all of these interfaces allow interactive re-configuration of post-processing settings and switching between output relevance scores from different models for comparison.
For our experiments on mobility information, we use an extended version of the dataset described by Thieu et al. (2017), which consists of 400 English-language Physical Therapy initial assessment and reassessment notes from the Rehabilitation Medicine Department of the NIH Clinical Center. These text documents have been annotated at the token level for descriptions and assessments of patient mobility status. Further information on this dataset is given in Table 1. We use ten-fold cross validation for our experiments, splitting into folds at the document level.

Relevance tagging workflow
All hyperparameters discussed in this section were tuned on held-out development data in cross- validation experiments. We report the best settings here, and provide full comparison of hyperparameter settings in the online supplements. 3

Preprocessing
Different domains exhibit different patterns in token and sentence structure that affect preprocessing. In clinical text, tokenization is not a consensus issue, and a variety of different tokenizers are used regularly (Savova et al., 2010;Soysal et al., 2018). As mobility information is relatively unexplored, we relied on general-purpose tokenization with spaCy (Honnibal and Montani, 2017) as our default tokenizer, and WordPiece (Wu et al., 2016) for experiments using BERT. We did not apply sentence segmentation, as clinical toolkits often produced short segments that interrupted mobility information in our experiments.

Feature extraction
Our system supports feature extraction for individual tokens in input documents using both static and contextualized word embeddings. Static embeddings Using static (i.e., noncontextualized) embeddings, we calculate input features for each token as the mean embedding of the token and 10 words on each side (truncated at sentence/line breaks). We used FastText (Bojanowski et al., 2017) embeddings trained on a 10year collection of physical and occupational therapy records from the NIH Clinical Center.
ELMo (Peters et al., 2018) ELMo features are calculated for each token by taking the hidden states of the two bLSTM layers and the token layer, multiplying each vector by learned weights, and summing to produce a final embedding. Combination weights are trained jointly with the token annotation model. We used a 1024-dimensional Figure 2: Precision, recall, and F-2 when varying binarization threshold from 0 to 1, using ELMo embeddings. The threshold corresponding to the best F-2 is marked with a dotted vertical line.
ELMo model pretrained on PubMed data 4 for our mobility experiments.
BERT (Devlin et al., 2019) For BERT features, we take the hidden states of the final k layers of the model; as with ELMo embeddings, these outputs are then multiplied by a learned weight vector, and the weighted layers are summed to create the final embedding vectors. 5 We used the 768-dimensional clinicalBERT (Alsentzer et al., 2019) model 6 in our experiments, extracting features from the last 3 layers.

Automated token-level annotation
We model the annotation process of assigning a relevance score for each token using a feedforward deep neural network that takes embedding features as input and produces a binomial softmax distribution as output. For mobility information, we used a DNN with three 300-dimensional hidden layers, relu activation, and 60% dropout.
As shown in Table 1, our mobility dataset is considerably imbalanced between relevant and irrelevant tokens. To adjust for this balance, for each epoch of training, we used all of the relevant tokens in the training documents, and sampled irrelevant tokens at a 75% ratio to produce a more balanced training set; negative points were re-sampled at each epoch. As token predictions are conditionally independent of one another given the embedding features, we did not maintain any sequence in the samples drawn. Relevant samples were weighted at a ratio of 2:1 during training.
After each epoch, we evaluate the model on all tokens in a held-out 10% of the documents, and calculate F-2 score (preferring recall over precision) using 0.5 as the binarization threshold of model output. We use an early stopping thresh-4 https://allennlp.org/elmo 5 Note that as BERT is constrained to use WordPiece tokenization, it may use slightly longer token sequences than the other methods. 6 https://github.com/EmilyAlsentzer/ clinicalBERT old of 1e-05 on this F-2 score, with a patience of 5 epochs and a maximum of 50 epochs of training.

Post-processing methods
Given a set of token-level relevance annotations, HARE provides three post-processing techniques for analyzing and improving annotation results.
Decision thresholding The threshold for binarizing token relevance scores is configurable between 0 and 1, to support more or less conservative interpretation of model output; this is akin to exploring the precision/recall curve. Figure 2 shows precision, recall, and F-2 for different thresholding values from our mobility experiments, using scores from ELMo embeddings.
Collapsing adjacent segments We consider any contiguous sequence of tokens with scores at or above the binarization threshold to be a relevant segment. As shown in Figure 3, multiple segments may be interrupted by irrelevant tokens such as punctuation, or by noisy relevance scores falling below the binarization threshold. As multiple adjacent segments may inflate a document's overall relevance, our system includes a setting to collapse any adjacent segments that are separated by k or fewer tokens into a single segment.
Viterbi smoothing By modeling token-level decisions as conditionally independent of one another given the input features, we avoid assumptions of strict segment bounds, but introduce some noisy output, as shown in Figure 4. To reduce  some of this noise, we include an optional smoothing component based on the Viterbi algorithm.
We model the "relevant"/"irrelevant" state sequence discriminatively, using annotation model outputs as state probabilities for each timestep, and calculate the binary transition probability matrix by counting transitions in the training data. We use these estimates to decode the most likely relevance state sequence R for a tokenized line T in an input document, along with the corresponding path probability matrix W , where W j,i denotes the likelihood of being in state j at time i given r i−1 and t i . In order to produce continuous scores for each token, we then backtrace through R and assign score s i to token t i as the conditional probability that r i is "relevant", given r i−1 . Let Q j,i be the likelihood of transitioning from state R i−1 to j, conditioned on T i , as: The final conditional probability s i is calculated by normalizing over possible states at time i: These smoothed scores can then be binarized using the configurable decision threshold.

Annotation viewer
Annotations on any individual document can be viewed using a web-based interface, shown in Figure 5. All tokens with scores at or above the decision threshold are highlighted in yellow, with each contiguous segment shown in a single highlight. Configuration settings for post-processing methods are provided, and update the displayed annotations when changed. On click, each token will display the score assigned to it by the annotation model after post-processing. If the document being viewed is labeled with gold annota- tions, these are shown in bold red text. Additionally, document-level summary statistics and evaluation measures, with current post-processing, are displayed next to the annotations.

Ranking methods
Relevance scoring methods are highly taskdependent, and may reflect different priorities such as information density or diversity of information returned. In this system, we provide three general-purpose relevance scorers, each of which operates after any post-processing. Segments+Tokens Documents are scored by multiplying their number of relevant segments by a large constant and adding the number of relevant tokens to break any ties by segment count. As relevant information may be sparse, no normalization by document length is used.
SumScores Documents are scored by summing the continuous relevance scores assigned to all of their tokens. As with the Segments+Tokens scorer, no adjustment is made for document length.
Density Document scores are the ratio of binarized relevant tokens to total number of tokens.
The same scorer can be used to rank gold annotations and model annotations, or different scorers can be chosen. Ranking quality is evaluated using Spearman's ρ, which ranges from -1 (exact opposite ranking) to +1 (same ranking), with 0 indicating no correlation between rankings. We use Seg-ments+Tokens as default; a comparison of ranking methods is in the online supplements.

Ranking interface
Our system also includes a web-based ranking interface, which displays the scores and corresponding ranking assigned to a set of annotated documents, as shown in Figure 6. For ease of visual distinction, we include colorization of rows based on configurable score thresholds. Rank-  ing methods used for model scores and gold annotations (when present) can be adjusted independently, and our post-processing methods (Section 3.2) can also be adjusted to affect ranking.

Qualitative analysis tools
We provide a set of three tools for performing qualitative analysis of annotation outcomes. The first measures lexicalization of each unique token in the dataset with respect to relevance score, by averaging the assigned relevance score (with or without smoothing) for each instance of each token. Tokens with a frequency below a configurable minimum threshold are excluded.
Our other tools analyze the aggregate relevance score patterns in an annotation set. For labeled data, as shown in Figure 2, we provide a visualization of precision, recall, and F-2 when varying the binarization threshold, including identifying the optimal threshold with respect to F-2. We also include a label-agnostic analysis of patterns in output relevance scores, illustrated in Figure 7, as a way to evaluate the confidence of the annotator. Both of these tools are provided at the level of an annotation set and individual documents.

Implementation details
Our automated annotation, post-processing, and document ranking algorithms are implemented in Python, using the NumPy and Tensorflow libraries. Our demonstration interface is implemented using the Flask library, with all backend logic handled separately in order to support modularity of the user interface.  Table 2 shows the token-level annotation and document ranking results for our experiments on mobility information. Static and contextualized embedding models performed equivalently well on token-level annotations; BERT embeddings actually underperformed static embeddings and ELMo on both precision and recall. Interestingly, static embeddings yielded the best ranking performance of ρ = 0.862, compared to 0.771 with ELMo and 0.689 with BERT. Viterbi smoothing makes a minimal difference in token-level tagging, but increases ranking performance considerably, particularly for contextualized models. It also produces a qualitative improvement by trimming out extraneous tokens at the start of several segments, as reflected by the improvements in precision.

Results on mobility
The distribution of token scores from each model (Figure 7) shows that all three embedding models yielded a roughly bimodal distribution, with most scores in the ranges [0, 0.2] or [0.7, 1.0].

Discussion
Though our system is designed to address different needs from other NLP annotation tools, components such as annotation viewing are also addressed in other established systems. Our implementation decouples backend analysis from the front-end interface; in future work, we plan to add support for integrating our annotation and ranking systems into existing platforms such as brat. Our tool can also easily be extended to both multi-class and multilabel applications; for a detailed discussion, see the online supplements.
In terms of document ranking methods, it may be preferred to rank documents jointly instead of independently, in order to account for challenges such as duplication of information (common in clinical data; Taggart et al. (2015)) or subtopics.
However, these decisions are highly task-specific, and are an important focus for designing ranking utility within specific domains.

Conclusions
We introduced HARE, a supervised system for highlighting relevant information and interactive exploration of model outcomes. We demonstrated its utility in experiments with clinical records annotated for narrative descriptions of mobility status. We also provided qualitative analytic tools for understanding the outcomes of different annotation models. In future work, we plan to extend these analytic tools to provide rationales for individual token-level decisions. Additionally, given the clear importance of contextual information in token-level annotations, the static transition probabilities used in our Viterbi smoothing technique are likely to degrade its effect on the output. Adding support for dynamic, contextualized estimations of transition probabilities will provide more fine-grained modeling of relevance, as well as more powerful options for post-processing.
Our system is available online at https:// github.com/OSU-slatelab/HARE/. This research was supported by the Intramural Research Program of the National Institutes of Health and the US Social Security Administration.