A Prioritization Model for Suicidality Risk Assessment

We reframe suicide risk assessment from social media as a ranking problem whose goal is maximizing detection of severely at-risk individuals given the time available. Building on measures developed for resource-bounded document retrieval, we introduce a well founded evaluation paradigm, and demonstrate using an expert-annotated test collection that meaningful improvements over plausible cascade model baselines can be achieved using an approach that jointly ranks individuals and their social media posts.


Introduction
Mental illness is one of the most significant problems in healthcare: in economic terms alone, by 2030 mental illness worldwide is projected to cost more than cardiovascular disease, and more than cancer, chronic respiratory diseases, and diabetes combined (Bloom et al., 2012). Suicide takes a terrible toll: in 2016 it became the second leading cause of death in the U.S. among those aged 10-34, fourth among those aged 35-54 (Hedegaard et al., 2018). Prevalence statistics suggest that roughly 141 of the 3,283 people who attended ACL 2019 have since had serious thoughts of suicide, 42 have made a plan, and 19 have actually made attempts. 1 The good news is that NLP and machine learning are showing strong promise for impact in mental health, just as they are having large impacts everywhere else. Traditional methods for predicting suicidal thoughts and behaviors have failed to make progress for fifty years (Franklin et al., 2017), but with the advent of machine learning approaches (Linthicum et al., 2019), including text analysis methods for psychology (Chung and Pennebaker, 2007) and the rise of research on mental health using social media (Choudhury, 2013), algorithmic classification has reached the point where it can now dramatically outstrip performance of prior, more traditional prediction methods (Linthicum et al., 2019;Coppersmith et al., 2018). Further progress is on the way as the community shows increasing awareness and enthusiasm in this problem space (e.g., Milne et al., 2016;Losada et al., 2020;Zirikly et al., 2019).
The bad news is that moving these methods from the lab into practice will create a major new challenge: identifying larger numbers of people who may require clinical assessment and intervention will increase stress on a severely resource-limited mental health ecosystem that cannot easily scale up. 2 This motivates a reformulation of the technological problem from classification to prioritization of individuals who might be at risk, for clinicians or other suitably trained staff as downstream users.
Perhaps the most basic way to do prioritization is with a single priority queue that the user scans from top to bottom. This "ranked retrieval" paradigm is common for Information Retrieval (IR) tasks such as document retrieval. The same approach has been applied to ranking people based on their expertise (Balog et al., 2012), or more generally to ranking entities based on their characteristics (Balog, 2018). Rather than evaluating categorical accuracy, ranked retrieval systems are typically evaluated by some measure of search quality that rewards placing desired items closer to the top (Voorhees, 2001). Most such measures use only item position, but we find it important to also model the time it takes to recognize desired items, since in our setting the time of qualified users is the most limited resource.
... I 've been depressed for ** l ** g ** I.. ..w ** h ** c ** d p ** t t ** s w ** e ** l ** d o ** s c ** d.. ..I really want to do it . ** w ** d ** .. individual ranking doc ranking Figure 1: Illustration of an assessment framework in which individuals are ranked by predicted suicide risk based on social media posts, posts are ranked by expected usefulness for downstream review by a clinician, and word-attention highlighting helps foreground important information for risk assessment. Real Reddit posts, obfuscated and altered for privacy.
Biased Gain (TBG, Smucker and Clarke, 2012), an IR evaluation measure that models the expected number of relevant items a user can find in a ranked list given a time budget. We observe that in many risk assessment settings (e.g., Yates et al. (2017); Coppersmith et al. (2018); Zirikly et al. (2019)), the available information comprises a (possibly large and/or longitudinal) set of documents, e.g. social media posts, associated with each individual, of which possibly only a small number contain a relevant signal. 3 This gives rise to a formulation of our scenario as a nested, or hierarchical, ranking problem, in which individuals are ordered by priority, but each individual's documents must also be ranked ( Figure 1). Accordingly, we introduce hierarchical Time-Biased Gain (hTBG), a variant of TBG in which individuals are the top level ranked items, and expected reading time is modeled for the ranked list of documents that provides evidence for each individual's assessment. In addition, we introduce a prioritization model that uses a three-level hierarchical attention network to jointly optimize the nested ranking task; this model also addresses the fact that in our scenario, as in many other healthcare-related scenarios, relevance obtains at the level of individuals rather than individual documents (cf. Shing et al., 2019). Using a test collection of Reddit-posting individuals who have been assessed for suicide risk by clinicians based on their posts (Shing et al., 2018), we use hTBG to model prioritization of individuals and demonstrate that our joint model substantially outperforms cascade model baselines in which the nested rankings are produced independently.
3 Our dataset, for example, has one severe risk individual with 1,326 postings, of which only two are "signal" posts identified by the experts. See Table 2 for detailed statistics.
2 Related Work NLP for Risk Assessment. Calvo et al. (2017) survey NLP for mental health applications using non-clinical texts such as social media. Several recent studies and shared tasks focus on risk assessment of individuals in social media using a multi-level scale (Milne et al., 2016;Yates et al., 2017;Losada et al., 2020). Shing et al. (2018) introduce the dataset we use, and Zirikly et al. (2019) describe a shared task in which 11 teams tackled the individual-level classification that feeds into our prioritization model (their Task B). Our work contributes by modeling the downstream users' prioritization task as taking a key step closer to the real-world problem.
Hierarchical Attention Attention, especially in the context of NLP, has two main advantages: it allows the network to attend to likely-relevant parts of the input (either words or sentences), often leading to improved performance, and it provides insight into which parts of the input are being used to make the prediction. These characteristics have made attention mechanisms a popular choice for deep learning that requires human investigation, such as automatic clinical coding (Baumel et al., 2018;Mullenbach et al., 2018;Shing et al., 2019). Although concerns about using attention for interpretation exist (Jain and Wallace, 2019; Wiegreffe and Pinter, 2019; Wallace, 2019), Shing et al. (2019) show hierarchical document attention can align well with human-provided ground truth.
Our prediction model, 3HAN, is a variant of Hierarchical Attention Networks (HAN, Yang et al., 2016). Yang et al. use a two-level attention mechanism that learns to pay attention to specific words in a sentence to form a sentence representation, and at the next higher level to weight specific sentences in a document in forming a document representation. Adapting this approach to suicide assessment of at-risk individuals, our model moves a level up the representational hierarchy, learning also to weight documents to form representations of individuals. This allows us to jointly model ranking individuals and ranking their documents as potentially relevant evidence, without document-level annotations.
Evaluating rankings. There is an extensive IR literature on quality measures for ranked lists (Järvelin and Kekäläinen, 2002;Chapelle et al., 2009;Smucker and Clarke, 2012;Sakai, 2019), which generally reward placing highly relevant items near the top of the list, and are often relatively insensitive to mistakes made near the bottom.
In the setting of suicidality risk assessment, we care about how much gain (number of at-risk individuals found) can be achieved for a given time budget. Time-biased gain (TBG, Smucker and Clarke, 2012) measures this by assuming a determined user working down a ranked list, with the discount being a function of the time it takes to reach that position. However, neither TBG nor other ranking measures, to the best of our knowledge, can measure the hierarchical ranking found in the scenario that motivates our work: ranking items (i.e. individuals) when each item itself contains a ranked list of potential evidence (their posts). In this paper, we design a new metric, hierarchical time-biased gain (hTBG), to measure the hierarchical ranking by incorporating the cascading user model found in Expected Reciprocal Rank (ERR, Chapelle et al., 2009) into TBG.

A Measure for Risk Prioritization
Section 1 argued for formulating risk assessment as a prioritization process where the assessor has a limited time budget. This leads to four desired properties in an evaluation measure: 4 • Risk-based: Individuals with high risk should be ranked above others. • Head-weighted: Ranking quality near the top of the list, where assessors are more likely to assess, should matter more than near the bottom. • Speed-biased: For equally at-risk individuals, the measure should reward ranking the one who can be assessed more quickly closer to 4 Throughout, assessor or user signify a clinician or other human assessor, and individual is someone being assessed. the top, so that more people at risk can be identified within a given time budget.
• Interpretable: The evaluation score assigned to a system should be meaningful to assessors.
Among many rank-based measures that satisfy the risk-based and head-weighted criteria, TBG directly accounts for assessment time in a way that also satisfies the speed-biased criterion (see Theorem 3.1). Furthermore, the numeric value of TBG is a lower bound on the expected number of relevant items -in our case, high-risk individualsfound in a given time budget (Smucker and Clarke, 2012), making it interpretable. After introducing TBG, in Section 3.2 we develop hierarchical Time-Biased Gain (hTBG), an extension of TBG, to account for specific properties of risk assessment using social media posts. 5

Time-Biased Gain
TBG was originally developed in IR for the case of a user seeking to find a relevant document, but here we frame it in the context of risk assessment ( Figure 2). TBG assumes a determined user (say a clinician) examining a ranked list of individuals in the order presented by the system. For each individual, the clinician first examines a summary and then decides whether to check relevance via more detailed examination, or to move on. Checking requires more time to make an assessment of whether the individual is indeed at-risk. TBG is a weighted sum of gain, g k , and discount, D(·), a function of time: T (k) is the expected amount of time it takes a user to reach position k: where t(i) is expected time spent at position i. Breaking down t(i), T s is the time it takes to read a summary and decide whether to check the individual; if yes (probability P check (rel i )), E i is expected time for detailed assessment, calculated as a function of the individual's total word count W i : where T α and T β scales words to time. The discount function D(t) decays exponentially with halflife h: where h is the time at which half of the clinicians will stop, on average. The expected stop time (or mean-life) is h ln(2) . Finally, the gain, g k is: where P check (rel k ) is the probability of checking the individual after reading the summary at position k, and P flag (rel k ) is the probability of then flagging that individual as high risk. Gain thus accrues only if a clinician actually finds a high-risk individual. The decay function in Equation 5 monotonically decreases with increasing time (and thus rank), so TBG satisfies the head-weighted criterion. Table 1 shows the parameters used in Smucker and Clarke (2012), which were estimated from user studies using data from TREC 2005 Robust track.
Particularly of interest in a time-limited assessment, we can prove that TBG is speed-biased: Theorem 3.1 (TGB satisfies the speed-biased criterion). Swapping an at-risk individual of longer assessment time ranked at k with an equally atrisk individual of shorter assessment time ranked at k + r, where r > 0, always increases TBG.
Proof. See Appendix B.1

Hierarchical Time-Biased Gain
TBG assumes that detailed assessment involves looking at all available evidence (Equation 4). However, in our setting, an individual may have a large or even overwhelming number of social media posts. One severe risk individual in the Sui-cideWatch dataset, for example, has 1,326 posts in Reddit, the vast majority of which would provide the assessor with no useful information. Therefore we need to prioritize the documents to be read, and a way of estimating when the user will have read enough to make a decision.
In general, clinicians engage in a sensemaking process as they examine evidence, and modeling the full complexity of that process would be difficult. We therefore make two simplifying assumptions: (1) that there is a high-signal document that suffices, once read, to support a positive relevance judgment, and (2) that the clinician will not read more than some maximum number of documents. These assumptions align well with those of Expected Reciprocal Rank (ERR), whose cascading user model assumes that as the user works down a ranked list (in our case, the ranked documents posted by a single individual), they are more likely to stop after viewing a highly relevant document than after viewing an irrelevant one, as their information need is more likely to have been satisfied (Chapelle et al., 2009). This results in a cascade model of user behavior: ERR = ∞ k=1 is the probability of stopping at position k as a function of relevance. This suggests replacing Equation 4 with the following expected time estimate for detailed assessment of an individual: where R i,l is the probability of stopping at the lth document for individual i, and W i,l > 0 is the cost (in our case, word count) of reading the l-th document for individual i. Note that for the special case of ∀i, l ∈ N, R i,l = 0, hTBG reduces to TBG. See Figure 3 for an illustration of E i of hTBG. For derivation of Equation 7 from ERR's cascading user model, see Appendix B.3.

Optimal Values for TBG and hTBG
Calculation of the optimal value for a measure is often important for normalization, though not always easy; in some cases it can be NP-hard (Agrawal et al., 2009, ERR-IA). Another popular approach is to normalize by calculating the metric with an ideal collection. For example, Smucker and Clarke (2012) calculate the normalization factor of TBG by assuming a collection with an infinite number of relevant documents, each of which lack any content. In our case, however, we are actually interested in an optimal value achievable for a given test collection: the optimal values of TBG and hTBG are properties of the bottleneck that occurs due to the user's limited time-budget. We find that: Theorem 3.2 (Optimal TBG). The optimal value of TBG under binary relevance is obtained if and only if (1) all at-risk individuals are ranked above not-at-risk individuals, and (2) within the at-risk individuals, they are sorted based on time spent in ascending order.
Proof. See Appendix B.1 Theorem 3.2 makes sense, as any time spent on assessing a not-at-risk individual is time not spent on assessing other potentially at-risk individuals. Preference in assessing individuals with shorter assessment time also increased the chance of assessing more individuals in the given time budget.
Minimum Individual Assessment Time. To calculate optimal hTBG, we need to minimize individual assessment time. A natural question to ask, then, is whether a result similar to Theorem 3.2 holds for the individual assessment time of hTBG in Equation 7. By swapping paired documents, we can use proof by contradiction to show that: Theorem 3.3. Minimum individual assessment time is obtained if the documents are sorted in descending order by Theorem 3.3 shows a surprisingly intuitive tradeoff between how relevant a document might be, and how much time (proportional to word counts) the expert needs to take to read it: highly relevant documents with short reading time are preferred.
Observe that Theorem 3.1 (speed-biased criterion) and Theorem 3.2 both apply to hTBG, as the two theorems only concern the ranking of individuals, not documents, and hTBG is an extension of TBG to measure the document ranking. Using Theorem 3.3 and Theorem 3.2, calculation of optimal TBG and hTBG values is simply a matter of sorting. For TBG, time complexity is O(n log(n)), where n ≤ K is the number of at-risk individuals in the test collection. For hTBG, worst-case time complexity is O(n log(n) + nm log(m)), where m ≤ L is the maximum number of relevant documents per individual.

Classification Model
We began by motivating risk assessment via social media as a person-centered, time-limited prioritization problem, in which the technological goal is to support downstream clinicians or other assessors in identifying as many people at risk as possible. This led to the conclusion that systems should not only rank individuals but, for each individual, rank their posts, and we introduced an evaluation framework that involves an abstraction of the user's process of identifying people at risk given a nested ranking.
Next, we need a system that can produce such nested rankings of individuals and their posts. Ideally such a system should be able to train on only individual-level, not document-level, labels, since suicide risk is a property of individuals, not documents, and document labels are more difficult to obtain. In addition, such a system should ideally produce additional information to help the downstream user -if not justification of its output, then at least highlighting potentially useful information.
To address this need, we introduce 3HAN, a hierarchical attention network (Yang et al., 2016) that extends up to the level of individuals, who are represented as sequences of documents. This architecture is similar to the network we proposed in Shing et al. (2019) for coding clinical encounters; it obtained good predictive performance and we also showed that, despite concerns about the interpretation of network attention (Jain and Wallace, 2019), hierarchical document-level attention succeeded in identifying documents containing relevant evidence. The architecture here differs in that it builds representations hierarchically from the word level, as opposed to pre-extracted conceptual features, and takes document ordering into account using a bi-directional GRU (Bahdanau et al., 2015). Specifically, our model has five layers ( Figure 4). The first is a word-embedding layer that turns a one-hot word vector into a dense vector. The second to fourth layers are three Seq2Vec layers with attention that learn to aggregate, respectively, a sequence of word vectors into a sentence vector, a sequence of sentence vectors into a document vector, and a sequence of document vectors into an individual vector (hence 3HAN). The final layer is a fully connected layer followed by softmax.
We detail our Seq2Vec layer in the context of aggregating a sequence of document vectors to an individual's vector, though the three Seq2Vec layers are the same. See Figure 4b for an illustration. Document vectors {d i,j } m j=1 are first passed through a bi-directional GRU layer. The outputs, after passing through a fully-connected layer and a non-linear layer, are then compared to a learnable attention vector, v attention . Specifically, where a i,j is the normalized document attention score for the j-th vector, and u i is the final aggregated individual vector. As shown in Equation 10, the transformed vector r i,j is compared with the learnable attention vector v attention using a dot product, and further normalized for the weighted averaging step in Equation 11.
Once we have the individual vector u i , we can predict the risk label of the individual by passing it through a fully-connected layer and a softmax. Specifically, Finally, we compare with the ground truth label y i of individual i using negative log-likelihood to calculate a loss:

Experimentation
We first introduce the test collection and then show how we can evaluate 3HAN and the cascade model baselines on the test collection using hTBG.
To demonstrate the effectiveness of the 3HAN model, which jointly learns to rank individuals and, within each individual, their posts as evidence, we compare it with different combinations of individual-level rankers and document-level rankers. Training details for all the models can be found in Appendix C.

Test Collection
In our experimentation, we use the University of Maryland Reddit Suicidality Dataset, v.2 (Shing et al., 2018;Zirikly et al., 2019). 6 This Englishlanguage dataset, derived from the 2015 Full Reddit Submission Corpus (2006)(2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014)(2015), includes 11,129 potentially at-risk individuals who posted on r/SuicideWatch (a subreddit dense in self-reports about suicidality, henceforth SW), as well as 11,129 control individuals who never posted on any mental-health related subreddit. Entire posting histories (not just from SW, but all Reddit forums) were collected. 7 An individual's number of posts can range from 10 to 1,326. See Table 2 for a detailed breakdown of number of posts per individual across datasets and risk categories.
The full dataset has three subsets with disjoint individuals. The first, which we term the WEAK SUPERVISION dataset, includes 10,263 individuals who posted in SW and 10,263 control individuals who did not; they are respectively considered to be indirectly positively and negatively labeled, very noisily since posting on SW does not necessary imply suicidal ideation. 8 The second set is the CROWDSOURCE dataset, including 621 individuals annotated by crowdsourcers with four risk levels: No Risk, Low Risk, Moderate Risk, and Severe Risk.   The last is the EXPERT dataset, including 242 individuals with the same four-level annotation, by four suicide risk assessment experts. 9 Along with the level of risk for each individual, the expert annotators also designated the single post that most strongly supported each of their low, moderate, or severe risk labels.

Evaluating with hTBG
As TBG and hTBG are measures designed for binary relevance judgements, we map the Severe Risk category to at-risk, and everything else to not-atrisk. 10 For word counts, we directly use the token counts in documents. We use the parameters that Smucker and Clarke (2012) estimated for TBG in user studies (Table 1). As discussed in Section 3.2, we assume there exists a maximum number of documents the clinician can read for each individual. 9 Shing et al. (2018) report reliable expert annotation, Krippendorff's α = .81. The original EXPERT dataset had 245 individuals; we exclude three owing to errors in processing. 10 Since the label definitions distinguish severe from moderate by focusing on the risk of an attempt in the near future, this binary distinction is aligned with recent work in suicidology that focuses specifically on characterizing "the acute mental state that is associated with near-term suicidal behavior" (Schuck et al., 2019).
We set that number to 50 for the calculation of hTBG; if no relevant document exists in the top 50 documents, we consider that individual a miss and set the gain to zero. 11 To rank individuals using our classification models, we use a standard conversion method to convert four-class probability to a single score: where R is {No, Low, Moderate, Severe}, and score rel i is the real number that maps to the risklevel of the individual i. We use {No = 0, Low = 1, Moderate = 2, Severe = 4} as our mapping -No Risk can plausibly be treated the same as a post with no annotation (e.g. a control individual), and exponential scaling also seems plausible although just one of many possibilities, which we leave for future work. The hTBG metric also requires a stopping probability for each document, R i,l . Assuming that the more severe the risk associated with a document is, the more likely the assessor is to stop and flag the individual, on the EXPERT dataset where we have document-level annotations, we can estimate the expected stopping probability as: where C annotators annotated the post as most strongly supporting their judgment. Score rel i,l,c is a mapping from the document-level risk by annotator c to a real number, with the same mapping used in Equation 14. Score max = 4 is the maximum in that mapping.
To reflect different time budgets, we report results with the half-life parameter ranging from 1 to 6 hours, which corresponds to expected reading time budgets from 1.4 to 8.7 hours.

Models for Ranking Individuals
3HAN. 3HAN is first pretrained on the binary WEAK SUPERVISION dataset. The model is then further tuned on the four-class CROWDSOURCE dataset by transferring the weights (except the last fully-connected prediction layer) over. We initialized and fixed the word embedding using the 200-dimensional Glove embedding trained on Twitter (Pennington et al., 2014). 12 3HAN Av. 3HAN Average is trained the same way as 3HAN, except that the last Seq2Vec layer (the layer that aggregates a sequence of document vectors to an individual vector) is averaged instead of using attention, which can be achieved by fixing a i,j = 1 m in Equation 10. This is similar to the HN-AVE baseline in Yang et al. (2016). Note that 3HAN AV cannot rank documents, as it lacks document attention.

LR. A logistic regression model is trained on
the CROWDSOURCE dataset. The feature vector for an individual is computed by converting documents into document-level feature vectors, and then averaging them to obtain an individual-level feature vector. For each document, we concatenate four feature sets: (1) bag-of-words for vocabulary count larger than three, (2) Glove embedding summing over words, (3) 194 features representing emotional topics from Empath (Fast et al., 2016), and (4) seven scores measuring document readability. 13 This model is included as a conventional baseline in suicide risk assessment, similar to the baseline found in Shing et al. (2018).

Models for Ranking Documents
3HAN Att. Document attention learned jointly with 3HAN. As a side effect to training our 3HAN model, we learn document attention scores, see Equation 10. This score can then be used to rank documents in terms of their relevance to the judgement. This availability of document ranking, despite a lack of document annotations, is a significant advantage of hierarchical attention networks, since fine-grained document annotations are difficult to obtain on a large scale. Sentence-and wordlevel attention are a further advantage, in terms of potentially facilitating user review (see Figure 1), although exploring that awaits future work.
Forward and Backward. Ranking an individual's documents in either chronological order or reverse chronological order is an obvious default in the absence of a trained model for document ranking, important baselines for testing whether a document ranking model actually adds value.

Results and Discussion
Our model, 3HAN+3HAN ATT, the only joint model, achieves the best performance on hTBG compared to all other combinations of individual rankers and document rankers across three different time budgets (Table 3). The result is significant except when compared to 3HAN AV+3HAN ATT. 14 However, using 3HAN ATT to rank documents implies that you have already trained 3HAN. Therefore, a more reasonable combination to compare with is 3HAN AV+BACKWARD, which we outperform by a significant margin.
Overall, the effect of document ranking is larger than the effect of individual ranking. Notably, the FORWARD document ranker always yields the worst performance. BACKWARD, on the other hand, is surprisingly competitive. We hypothesize that this may be an indication that suicidal ideation worsens over time, or perhaps of the unfortunate  event of suicide attempts following posting a Severe Risk document. This motivates the importance of prioritizing the reading order of documents: being able to find evidence early in suicide assessment leaves more time for other individuals, and will reduce probability of misses. Document ranking alone does not decide everything, as 3HAN+BACKWARD outperforms LR+3HAN ATT. It is the combination of 3HAN and its document attentions that produce our best model. This makes sense, as 3HAN, while learning to predict the level of risk, also learns which documents are important to make the prediction. Figure 1 shows the top 3 documents in a summary-style view for each of the highest ranked 3 individuals, with word-level attention shown using shading. Words without attention are obfuscated; others are altered to preserve privacy.
Previously Existing Measures. For previously existing measures, e.g. TBG and NDCG@20, document ranking has no effect, and thus these are not suitable measures in our scenario. However, we include results here for reference (Table 4). Since 3HAN AV. and LR cannot rank documents, it is impossible to calculate hTBG, so we report results on the chronologically backward ranking strategy. NDCG@20 is NDCG score cut off at 20, chosen based on the optimal hTBG value.

Conclusions and Future Work
We introduced hTBG, a new evaluation measure, as a step toward moving beyond risk classification to a paradigm in which prioritization is the focus, and where time matters. Like TBG, the hTBG score is interpretable as a lower bound on the expected  number of relevant items found in a ranking, given a time budget. In our experiment, a "relevant item" is a person classified by experts as being at risk of attempting suicide in the near future. Measured at an expected reading time budget of about half a day (4hr20min, half-life 3hrs), our joint ranking approach achieved hTBG of 12.49 compared with 11.70 for a plausible baseline from prior art: using logistic regression to rank individuals, and then looking at a individual's posts in backward chronological order. That increase is just a bit short of identifying one more person in need of immediate help in the experiment's population of 242 individuals. There are certainly limitations in our study and miles to go before validating our approach in the real world, but our framework should make it easy to integrate and explore other individual rankers, document rankers and explanation mechanisms, and to actually build user interfaces like the schematic in Figure 1. WSDM '09,New (Reddit) is intended for anonymous posting. In addition, since Reddit is officially anonymous, but that is not enforced on the site, the dataset has undergone automatic de-identification using named entity recognition aggressively to identify and mask out potential personally identifiable information such as personal names and organizations, in order to create an additional layer of protection (Zirikly et al., 2019). In an assessment of de-identification quality, we manually reviewed a sample of 200 randomly selected posts (100 from the SuicideWatch subreddit and 100 from other subreddits), revealing zero instances of personally identifiable information.
Following Benton et al. (2017), we treat the data (even though de-identified) as sensitive and restrict access to it, we use obfuscated and minimal examples in papers and presentations, and we do not engage in linkage with other datasets.
The dataset is available to other researchers via an application process put in place with the American Association of Suicidology that requires IRB or equivalent ethical review, a commitment to appropriate data management, and, since ethical research practice is not just a matter of publicly available data or even IRB approval (Zimmer, 2010;Benton et al., 2017;Chancellor et al., 2019), a commitment to following additional ethical guidelines. Interested researchers can find information at http://umiacs.umd.edu/ ∼ resnik/umd reddit suicidality dataset.html.

B.1 Time-Biased Gain
In order to prove that TBG statisfies the speedbiased criterion, consider two individuals ranked at consecutive positions k and k + 1; if we swap the two individual, the change in TBG score is: This leads to Lemma B.1-B.3: Lemma B.1. Swapping a not-at-risk individual ranked at k with an at-risk individual ranked at k + 1 always increases TBG.
Proof. Let g k = 0 and g k+1 > 0. Equation 16 simplifies to ∆TBG = g k+1 (D(T (k)) − D(T (k) + t(k))) (17) which is always positive because the decay function monotonically decreases, and each assessment of an individual requires at least T s seconds.
Lemma B.2 (Risk-based Criterion). The optimal value of TBG under binary relevance is obtained only if all not-at-risk individuals are ranked below all at-risk individuals.
Proof. Let π be a ranking of individuals that yields the optimal value of TBG. Assume that in π there exist not-at-risk individuals ranked before at-risk individuals. Let the k-position be the lowest ranked not-at-risk individual that is at least in front of one at-risk individual, we can then apply Lemma B.1 to increase TBG. This leads to a contradiction. Lemma B.3. Swapping an at-risk individual of longer assessment time ranked at k of with an atrisk individual of shorter assessment time ranked at k + n, where k + n is the closest at-risk individual ranked lower than k, always increases TBG.
Proof. Let g k = g k+n > 0, and ∀i ∈ {i|k < i < k + n}, g i = 0. We have which is always positive because the decay function monotonically decreases, and t(k+n) < t(k) from the assumption that the individual at k + n has shorter assessment time.
Lemma B.3 naturally leads to a proof for the speed-biased property of TBG: Proof for Theorem 3.1. Applying Lemma B.3, we know that swapping k and k + r leads to a positive gain between the two. Now, consider all at-risk individuals ranked between k and k + r: ∀u, s.t. k < u < k + r, the difference is: which is always greater than or equal to zero due to the fact that the decay function monotonically decrease, and t(k + r) < t(k). Thus, the net difference is always larger than zero, thus satisfying the speed-biased criterion.
Finally, combing previous results, we can easily show: Proof for Theorem 3.2. A direct consequence of Theorem 3.1 is that if the at-risk individuals are sorted by assessment time in ascending order, no swapping between any two individuals can increase TBG. This, combined with Lemma B.2, that all at-risk individuals are on top of not-at-risk individuals, leads to the necessary condition. Because any swapping within the not-at-risk individuals does not change TBG when no at-risk individuals are ranked lower, this implies that ranking according to Theorem 3.2 gives us a unique and optimal value, which satisfies the sufficient condition of Theorem 3.2.

B.2 Hierarchical Time-Biased Gain
The assessment time of an individual ranked at k, t(k), is monotonic with E i , thus showing minimal value of E i suffices. Recall that E i is calculated as: (20) Consider, again, swapping a document at rank l with a document at rank l + 1 belonging to the same individual i. The change in E i is: where κ i,l = T α l−1 j=1 (1 − R i,j ) ≥ 0 is a fixed term that is not affected by the swap.
Equation 21 also points to an important observation: Lemma B.4. If W i,l+1 R i,l − W i,l R i,l+1 < 0 and R i,j < 1 for all j < l, then swapping document l with document l + 1 will decrease E i .
Proof. This follows directly from Equation 21.
Lemma B.5. If R i,j < 1 for all j, then minimum individual assessment time is obtained if and only if the documents are sorted in descending order by Proof. Let τ be a document ranking that yields the minimum individual assessment time, and for the sake of contradiction, not a ranking that can be obtained by ranking according to We can, thus, find two neighboring documents, without loss of generality, l and l + 1, such that: this leads to: since all W > 0. Lemma B.4 together with the prerequisite that R i,j < 1 for all j then suggest that swapping the two leads to a decrease of E i . This contradicts with the assumption that τ is an optimal ranking. This proves that to achieve minimum individual assessment time, it is necessary to sort by R i,l W i,l . The sufficient condition follows by the fact that swapping tied documents does not lead to change in E i , as shown in Equation 21 Proof for Theorem 3.3. Let τ be a document ranking according to R i,l W i,l . Let m be the document such that R i,m = 1 and is ranked closer to the top then any other document with R i,: = 1 (i.e. with the shortest W i,: ). Now, consider using m to cut the documents into two partitions: the first partition of documents are ones ranked before m. Applying Lemma B.5, this partition of documents are already in optimal sorted order, since there's no R i,: = 1. The second partition, documents ranked lower than m, the ranking simply does not matter, as Equation 20 shows, the (1 − R i,m ) term will make everything zero afterwards. Now, let's consider moving a document from the second partition to the first partition. Since any documents in the second partition has a R i,j W i,j that is smaller than any documents in the first partition, after you move the document, the optimal ranking for the first partition will put the document at the bottom, right next to m. And since R i,m W i,m ≥ R i,j W i,j due to the original ordering, we can apply Lemma B.4, which can swap the document back below m. Next, consider moving the lowest ranked document of the first partition (the one ranked at m − 1) to the second partition. This will always increase E i , as shown from Lemma B.4. Moving any other document in the first partition will also increase E i as least as much as before, since the process is equivalent to swapping with (and thus potentially increase E i ) any intermediate documents in between.
Combine these two together, we show that E i is at a minimum value when sorted in descending order according to

B.3 Relationship between ERR and hTBG
Here we show the derivation from the cascading user model in ERR to the individual assessment time estimation (E i ) in hTBG. ERR assumes a stopping probability (written in hTBG terms): The expected words read, can then be calculated as: by letting R i,L = 1 (the user has to stop reading at the last document). To show this, observe that W i,1 appears in all L terms of the summation, thus the coefficient for W i,1 is simply L l=1 (R i,l l−1 j=1 (1 − R i,j )) = 1, from both simple manipulation and the fact that we are summing over probability. Similarly, W i,2 appears in all L terms except with l = 1, thus (1 − R i,1 ). For W i,3 it is (1 − R i,1 ) − R i,2 (1 − R i,1 ) = 2 j=1 (1 − R i,j ). The rest follows.

C Appendix: Training Details
All models are built using AllenNLP (Gardner et al., 2018). Tokenization and sentence splitting are done using spaCy (Honnibal and Johnson, 2015).
The CROWDSOURCE dataset is split into a training set (80%) and a validation set (20%) during model development. We did not test on the EX-PERT dataset until all parameters of the models were fixed. Cross validation on the training set is used for hyperparameter tuning. For 3HAN, we used ADAM with learning rate 0.003, trained for 100 epochs with early stopping on the validation dataset, with patience set to 30. For 3HAN AV, the same hyperparameters are used. For LR, we used SGD with learning rate 0.003, trained for 100 epochs with early stopping on the validation dataset, with patience set to 30.
Both 3HAN and 3HAN AV's Seq2Vec layers use bi-directional GRU with attention. The wordto-sentence layer has input dimension of 200, hidden dimension of 50, and output dimension of 100, since the GRU is bi-directional. The sentence-todocument and document-to-individual layer, similarly, has input dimension of 100, hidden dimension of 50, and output dimension of 100. Hyperparameters were selected using cross validation on the training set split of the CROWDSOURCE dataset.