Writing Strategies for Science Communication: Data and Computational Analysis

Communicating complex scientiﬁc ideas without misleading or overwhelming the public is challenging. While science communication guides exist, they rarely offer empirical evidence for how their strategies are used in practice. Writing strategies that can be automatically recognized could greatly support science communication efforts by enabling tools to detect and suggest strategies for writers. We compile a set of writing strategies drawn from a wide range of prescriptive sources and develop an annotation scheme allowing humans to recognize them. We collect a corpus of 128K science writing documents in English and anno-tate a subset of this corpus. 1 We use the annotations to train transformer-based classiﬁers and measure the strategies’ use in the larger corpus. We ﬁnd that the use of strategies, such as storytelling and emphasizing the most important ﬁndings, varies signiﬁcantly across publications with different reader audiences.


Introduction
Communicating scientific discoveries to a general audience of readers is difficult. A researcher or writer interested in doing so is faced with the challenging task of translating complex scientific ideas in an engaging manner without misleading or overwhelming their audience. There are many guides to science communication (e.g., Blum et al., 2006), but they rarely offer empirical evidence for how their advice is used, or proven effective, in practice. The potential science communicator is then confronted with the additional hurdle of understanding how to implement these guidelines in their writing.
Effective science communication requires understanding the unique needs and expectations of different audiences and stakeholders in science (Nisbet and Scheufele, 2009). We envision natural language processing technologies that help science writers communicate more effectively. These technologies might automatically classify common strategies in a writer's own text, support writers to adapt language to specific readers, or guide readers through personalized article recommendations.
As a first step, we compile a set of strategies from a wide range of prescriptive science writing sources in English and develop an annotation scheme allowing humans to recognize these strategies in texts about science. We introduce a new corpus of 128K university press releases, science blogs, and science magazines and annotate a subset of 337 texts. We use the annotations to train transformer-based classifiers to explore the communicative goals of science writing by analyzing variations in the strategies' use across several scientific communication forums.
Our paper is the first computational analysis of writing strategies driven by science communication theory. We find that most strategies are prevalent throughout our corpus and that publication venues with varying audiences use strategies differently. For example, press releases emphasize the impacts of science more than magazine articles, which instead tell more stories about the science. We also find that higher quality newspaper articles, as rated by expert journalists, use more storytelling and analogies than lower quality articles.

Defining Science Communication Writing Strategies
The goal of general science communication is to increase public awareness, enjoyment, interest, and understanding about science (Burns et al., 2003). Based on the idea of compositionality in discourse theory (Bender and Lascarides, 2019), we can think of the communicative intent of science writing as being made up of smaller communication goals rep-resented in particular passages of an article (Grosz and Sidner, 1986;Louis and Nenkova, 2013a). Our computational approach builds on this theoretical assumption by annotating sentences and letting an article inherit the attributes we find in its sentences. Past work on science communication has taken a similar view (Louis and Nenkova, 2013a) by using syntactic relations to characterize an article's communicative goals, allowing them to emerge inductively rather than from a theory of science communication. Our complementary approach starts with science communication guides to construct theory-driven communicative goals (referred to as "writing strategies" and consisting of lexical to multi-sentence features), and explores their use in a diverse range of science communication text.
To define our writing strategies, we categorized and grouped advice from style guides for science communicators. These guides were a mix of online resources, books, and academic articles (see Table 5 in the appendix). We selected the guides based on discussions with three expert science communicators at a large research university's press department and through online searches. We stopped adding resources when we reached saturation (Holton, 2007), meaning that each new resource had fewer new strategies and suggesting that our resources provided good coverage.
Two authors open-coded (Holton, 2007) the suggestions from each guide by assigning each piece of advice in a resource a code that represented its highlevel strategy (such as "avoid jargon"). The authors then looked at other resources to see whether the same advice appeared there. Each new piece of advice was added with a new code. After coding all resources, the authors grouped the codes into a set of 10 suggested writing strategies. Appendix A.1 provides additional details on the coding and categorization. The strategies are as follows (examples of each are given in Table 9 in the appendix): LEDE A few sentences at the beginning of an article, called a lede (spelled "lede" for easier differentiation with its homograph "lead"), that draws a reader in and makes them want to read more.
MAIN Sentences describing the main findings being reported by the original paper in order to not overwhelm the reader with details.
IMPACT Writing about the real world impact of the science or findings being reported in order to excite readers. This can include future technolo-gies, breakthroughs the findings might enable, or their societal implications.
EXPLANATION Explanations about scientific subjects to improve reader understanding. This could be explaining a certain topic or word, or what researchers did in a study and what the findings mean.
ANALOGY Analogies or metaphors used as a way to explain concepts or make ideas in the article more relatable. STORY Stories to engage readers and make the reported science more interesting. This can include short story snippets, or coming back to an underlying story throughout an article.
PERSONAL Including personal details about researchers in order to make them more approachable and add depth to the story. JARGON Avoiding specialized terminology or jargon as much as possible as it can overwhelm readers. ACTIVE Using the active voice to make the writing more lively and engaging.
PRESENT Similar to ACTIVE, using present tense verbs also to make the writing more lively and engaging.
Some of these strategies are specific to science writing, such as emphasizing the real world impact of the findings (IMPACT), while others are often thought of as general rules for good writing, such as using the active voice (ACTIVE). Both types of strategies were commonly referenced in the resources we analyzed, which suggests that engaging science writing shares traits with engaging writing in other disciplines while also containing its own set of unique strategies.

Dataset
In order to study the use of these strategies by science writers and build classifiers for automatic identification, we collected a corpus of 128K documents from a variety of science communication sources. We focused on four major types of U.S.based venues, representing a broad spectrum of science communication for different audiences: blog sites, popular science magazines, university press releases, and scientific journal magazines. Past work has shown that blog sites usually write to scientifically literate and engaged readers (Ranger and Bultitude, 2016), while university press releases often write to other science journalists (Bratton et al., 2019;Sumner et al., 2014). We selected popular science magazines since they target a more general audience, and scientific journal magazines as they often write to those involved in research, though not necessarily in the same domain (Nielsen and Schmidt Kjaergaard, 2011). The choices of website or publication we collected from each venue category were based either on previous work covering those categories (e.g., blog posts; Vadapalli et al., 2018) or as a convenience sample based on what was widely available. One note is that while past work has used the blogs sites we selected as sources for high quality science blogs (sciencedaily.com and phys.org), these sites also source a large portion of their content from press releases, often only changing headlines and lede sentences.
We scraped articles from each of these sources for all of 2016-2019 using the Wayback Machine, 2 resulting in 137,828 articles. Appendix A.2 provides more details on site selection.

Filtering
To focus on science communication, specifically, we removed articles matching U.S.-centric political keywords such as Trump, democrats, and Senate. Appendix A.3 lists all filter keywords. We also removed all articles over 15,000 or under 1,500 characters, since these represented ei-ther multiple articles on the same page, article previews, or scraper errors. After filtering we had a total of 128,253 articles. In total 7% of documents were filtered (3.5% removed for political keywords and 3.5% for length). Table 1 details the sites for each venue and the number of articles after filtering. URLs for all scraped articles are available at https://github.com/ talaugust/scientific-writing-strategies.

Annotation
Recall that our goal is to measure the use of strategies from Section 2 in our corpus. We sample 337 articles stratified across sites to gather a spread of articles and balance the articles across venues. Each article was given to two annotators who were trained on the writing strategies and instructed to annotate the use of strategies at the sentence level. Concretely, each "annotation" corresponds to a contiguous chunk of one or more sentences labeled with one of the seven strategies. A sentence can be labeled with multiple strategies. Three of the strategies, JARGON, ACTIVE, and PRESENT, were not annotated because we believe they can be reliably detected using existing methods based on lexical and syntactic features; see Section 5.1. Figure 5 (in the appendix) presents an example of an annotated excerpt and the task interface. 3 We conducted annotation in sets of 50 articles. After each set, one author measured agreement and manually evaluated a subset of annotations by both annotators. This author then acted as a coordinator for the annotators, providing suggestions or revisions to annotation guidelines. Additionally, annotators were able to look at the other's annotations after completing an article. Figure 4 in the appendix plots Krippendorff's α after each batch of 50 articles.

Annotator Agreement
While our strategies emerged from prescriptive advice in guides, our annotators had to interpret these strategies in the context of real-world scientific writing. This, to our knowledge, has not been done systematically before, though we suspect editors do it frequently. Because writing strategies are somewhat subjective and we wanted our categorizations to be flexible to the different good-faith interpretations of each strategy, we were not aiming to obtain perfect agreement on each strategy. Table 2 reports on simple agreement rates for the annotated strategies. Previous work annotating spans of text for communicative goals, such as framing (Card et al., 2015), propaganda techniques (Da San Martino et al., 2019), hate speech (Sap et al., 2020), statement strength (Tan and Lee, 2014), and agency (Sap et al., 2017), have shown that reaching high agreement is difficult. While agreement measures differ across these annotations tasks due to differences in how spans were annotated (e.g., preselected sentences or open selection), past work has reported Krippendorff's α agreement levels ranging between 0.3 < α < 0.7, in which we fall along the lower to moderate end (0.3 < α < 0.5). We discuss the use of α in Appendix A.4 and there Table 6 reports α agreement for each strategy.
The annotators identified a total of 10,843 sentences (316,263 tokens) with one of the seven strategies in 337 articles. Table 3 details the number of sentences and average number of words for each strategy annotation span.

Abandoned Strategies
Two categories that achieved low agreement (as measured by α) were EXPLANATION and PER-SONAL, which we drop from further analyses for the following reasons. We found that annotator 1 (a 1 ) annotated many more EXPLANATION strategies than a 2 (3,402 vs. 1,275). While the majority of a 2 's EXPLANATION annotations agreed with a 1 's (886 out of 1,275), a 2 highlighted fewer in general, suggesting that this lower agreement was due to the annotators having different thresholds for the EXPLANATION strategy.
For PERSONAL, our discussions with both annotators revealed that a 2 had assumed PER-SONAL strategies were any reference to the researchers in an article (e.g., "Professor X, head of the mutation lab at the Academy") while a 1 focused on personal aspects of a researcher (e.g., "Professor X, who has been in a wheelchair since birth"). While we tried to resolve this difference early on, annotators still had difficulty reaching agreement on PERSONAL strategies. This discrepancy lowered agreement but highlighted an interesting nuance in the PERSONAL strategy. Because both strategies are still important for science communication, we report on their use in the corpus in Figures 6 and 7 in the appendix.

Hypotheses
Our annotated corpus allowed us to begin to explore how strategies relate to the communicative goals of different science communication venues. To do this, we introduce hypotheses informed by existing literature.
Hypotheses H1-H4 are based on our expectations for how strategies can differentiate science writing venues in our corpus based on their underlying goals. For these hypotheses, we evaluate strategy use across our corpus.
Hypotheses H5 and H6 focus on how strategies might relate to other important issues in science communication. They build on past research in science communication exploring article quality (Louis and Nenkova, 2013b) and sensationalism (Sumner et al., 2014). These issues are introduced with their own annotated datasets, and since we have no strategy annotations for these datasets, we report only on the aggregated predictions of our classifiers.
H1: LEDE is used once or twice within an article, but consistently across our entire corpus. Because the LEDE strategy is well adopted as a common strategy in general journalism (Pöttker, 2003), and LEDE sentences are only used at the beginning of an article, we expect low but consistent use of LEDE across the corpus.
H2: Press releases use higher IMPACT than other venues. One goal of press releases is to encourage other science writers to pick up a story. Because a key component to selecting stories is impactful findings (Hayden et al., 2013), we expect that press releases will emphasize this more.
H3: Magazines use lower JARGON, higher AC-TIVE and PRESENT and higher STORY than other venues. Magazines target a broader readership compared to other venues, making it likely they use these strategies that are common in prescriptive guides for general good writing (Strunk, 2007) to relate to a wider audience.
H4: Blog sites use higher JARGON and MAIN, and lower IMPACT compared to other sites. Blog sites predominantly focus on other science educated or interested readers (e.g., phys.org reports a readership of "5 million scientists, researchers, and engineers every month" 4 ). This suggests that blogs' focus is less on attracting a broad readership (higher JARGON) or encouraging news uptake (lower IMPACT) and more on communicating the main point of a journal article (higher MAIN).
H5: Higher quality science news articles use more STORY and ANALOGY than lower quality articles. Past work on science news quality has suggested that features related to storytelling and figurative language (e.g., coherence and descriptive language) are associated with higher quality articles (Louis and Nenkova, 2013b).
H6: Press releases that sensationalize the claims of the original journal paper use higher IMPACT and MAIN than press releases that do not. While emphasizing the impact of findings is a useful tool in engaging readers, work has shown that press releases will sometime sensationalize the claims of the original journal article (Sumner et al., 2014). Sumner et al. (2014) categorized sensationalism into three categories: exaggerated advice (suggesting actions the original paper did not), exaggerated causal claims (making causal claims the 4 https://sciencex.com/help/about-us/ paper did not), and inference to humans from animal research. These categories most nearly relate to our strategies on the findings of a paper: IMPACT and MAIN, and we hypothesize these strategies' overuse is related to sensationalism.

Strategy Classification
We used our collected corpus and annotations to automate recognition of writing strategies and to evaluate our hypotheses. We describe our methods for classifying strategies with rules and with human annotations (Sections 5.1 and 5.2). We then discuss methods for using these classifiers to estimate the use of strategies in our corpus and overall classifier performance (Section 5.3).

Rule-Based Strategies
As discussed earlier, three of the strategies could be reasonably identified using rules, and were not annotated. JARGON We used common science jargon word lists drawn from Rakedzon et al. (2017) and Gardner and Davies (2013) to detect jargon use. The word list from Rakedzon et al. (2017) consists of 2,949 words common in scientific journal abstracts and articles while rare in common usage. We augmented this list with the core Academic Vocabulary List (AVL, Gardner and Davies, 2013). The AVL is a list of the top 3,000 word lemmas based on 120 million words of academic texts from the Corpus of Contemporary American English (Davies, 2009). High JARGON means higher use of these specialized terms, which is negatively associated with the strategy (i.e., since the recommended strategy is to avoid specialized terms). ACTIVE We identified active and passive clauses by counting the 'nsubj' and 'nsubj:pass' words from a parse of each article using Stanford NLP's dependency labels in the Stanford NLP Pipeline. 5 We normalized all active clauses by the number of verbs in an article.
PRESENT For measuring present tense, we normalized the number of present tense verbs using Stanford NLP's Universal Features (similar to POS tags and part of the same Stanford NLP Pipeline; Manning et al., 2014) over all verbs in an article.

Sentence Classifiers
For the remainder of our strategies, we trained classifiers based on the annotations collected to estimate the prevalence of each strategy in our corpus. Each classifier takes a single sentence as input and provides a binary label (present or absent) for a given strategy. Apart from pretraining, the classifiers were trained separately for each strategy. We base our classifiers off RoBERTa (Liu et al., 2019) as it is a high-performing contextual word representation learner that has achieved state-ofthe-art results on multiple NLP benchmarks, and which comes pretrained. We use Huggingface's RoBERTa implementation. 6 We start by continuing to pretrain RoBERTa on additional in-domain text to tailor the model more closely to our task. This additional pretraining followed two phases as in Gururangan et al. (2020): pretraining on 11.90M general news articles from REALNEWS (Zellers et al., 2019) for 12.5K steps (domain-adaptive pretraining), and then pretraining on a held-out subset of 100k documents from the unannotated portion of our own corpus for 10 epochs (task-adaptive pretraining). Appendix A.5 includes details for both pretraining steps.
Finally, we finetune our pretrained RoBERTa model on each sentence-level classification task separately, making 5 binary classifiers (LEDE, MAIN, IMPACT, ANALOGY, STORY). Our pretrained RoBERTa models were finetuned using a 80%, 10%, 10% train, validation, test setup using all annotated articles. Articles were randomly split across the sets, meaning no two sentences from the same article could occur across sets. Appendix A.6 includes additional finetuning details.
Using classifiers optimized for individual classi- fication can lead to biases when estimating category proportions (Hopkins and King, 2010). Past work has suggested that using a well-calibrated classifier leads to better proportion estimation in large unlabeled corpora (Card and Smith, 2018). Calibration refers to the long-run accuracy of predicted probabilities, where a well-calibrated probabilistic classifier at the level µ is one that predicts class k with probability µ when the proportion of instances correctly assigned to k is also µ. (2018), we perform model selection based on calibration error on heldout data during hyperparameter tuning. We estimate calibration error using the adaptive binning procedure from Nguyen and O'Connor (2015). After picking our most well-calibrated classifiers, we measure the rate of each strategy across a collection of documents by averaging the classifiers' predicted posterior probabilities of a positive label. This is referred to as Probabilistic Classify and Count (PCC; Bella et al., 2010) and is a standard method for predicting label distributions in a corpus using a probabilistic classifier (Card and Smith, 2018).

Evaluation
We evaluate our classifiers in two ways: on individual examples (i.e., for reporting F 1 ), and in aggregate on a held-out annotated test set.
Our goal for the classifiers is to estimate aggregated proportions in our corpus, not to achieve perfect performance. For this reason, we report on classifier F 1 performance only to establish that the classifiers are reasonably able to detect strategies. Table 4 details the precision, recall, F 1 scores, average calibration error and accuracy of the trained classifiers, and baseline accuracies for the most frequent class predictions for each strategy. Our classifiers achieved F 1 scores between .40 and .55, which is comparable to other classifications of communicative goals, such as propaganda technique detection (e.g., Because we are most interested in estimated proportions, we also compared the classifiers' predicted strategy rates in our held-out test set with the actual rates of the annotated strategies. Actual strategy rate is calculated as the number of sentences containing a strategy divided by the total number of sentences in the test set. Figure 1 illustrates this comparison. While we do see some discrepancies between actual rates and our predicted rates, these differences are small (< .05 rate difference, or less than 5% of sentences) and the trend of each strategy remains the same (e.g., STORY and MAIN are the most common, LEDE and ANALOGY are the least), suggesting that the classifiers estimate strategy rates with sufficiently high accuracy to begin comparing rates across strategies. Figure 6 in the appendix compares the predicted rate of strategies in the full dataset compared to actual rates of strategies in the test set broken down by site.
We additionally evaluated how accurate our automatic measures for JARGON, ACTIVE, and PRESENT were by randomly sampling 5 sentences from the top and bottom 10% of articles containing JARGON, ACTIVE, and PRESENT (as measured by our rule-based approaches) and manually inspecting them for correct word classifications. The rules for each measure are in line with our intuitions about JARGON, ACTIVE, and PRESENT with a large majority of words (Over 80% in the 30 sentences evaluated) being identified correctly as jargon, active voice, or present tense. Table 10 in the appendix provides examples of this evaluation.

Evaluating Strategy Applications
Evaluating our classifier output against goldstandard human annotations, as reported in Section 5.3, establishes the validity of our classifiers. We next turn to our hypotheses introduced in Section 4 to illustrate how we can use the strategies, classifiers, and corpus to explore the communicative goals of science writing. We introduce each hypothesis and report on its results separately.
H1: LEDE is used once or twice within an article, but consistently across our entire corpus. Figure 2a plots the estimated number of LEDE sentences per article across each site. Supporting H1, the majority of sites peak at either 0 or 1 LEDE sentences, with all sites tapering off quickly after that. theatlantic.com does have a higher number of predicted LEDE sentences (with 20% of articles containing more than 2 sentences). This might be due to theatlantic.com articles being longer (since they are full magazine articles) and therefore using more text to entice readers.
H2: Press releases use higher IMPACT than other venues. We find support for H2: press release sites like news.harvard.edu, rochester.edu, and news.stanford.edu have larger modes than other sites for IMPACT sentences in Figure 2c. For example, close to 15% of articles in rochester.edu have over 5 estimated IMPACT sentences, compared to 7 or 8% of articles in scientificamerican.com or theatlantic.com having that same number. This is especially striking because scientificamerican.com and theatlantic.com generally have much longer texts, since they are full magazines, compared to press releases.
H3: Magazines use lower JARGON, higher AC-TIVE and PRESENT and higher STORY than other venues. Texts from theatlantic.com and scientificamerican.com, the two magazine sites, had the lowest and third lowest use of JARGON in the corpus with average rates below 0.2 (i.e., less than 20% of words), macro-averaged across articles (see Figure 6 in the appendix). Magazines also had the highest use of ACTIVE and some of the highest PRESENT. Additionally, theatlantic.com was the only site to have close to 5% of articles estimated to have more than an 15 STORY sentences (Figure 2e).
H4: Blog sites use higher JARGON and MAIN, and lower IMPACT compared to other sites.  Figure 7 in the appendix. Figure 6 plots the estimated proportion of sentences using each strategy for all sites.
We find mixed support for H4: blog sites do use higher JARGON and MAIN, but not lower IMPACT compared to other sites. The two blog sites, sciencedaily.com and phys.org, used the highest and third highest amount of JARGON (both above 20%), and showed high levels of MAIN compared to other sites, especially phys.org, which had one of the highest rates of MAIN (close to 0.20, or 20% of sentences, see Figure 6 in the appendix). However, we do not find that these blog sites used lower IMPACT; in fact, we see the opposite. Blog sites use almost the same level of IMPACT as press releases. This might be due to some blog posts focusing on breaking science news, similar to press releases (Ranger and Bultitude, 2016), or due to rehosting press releases.
Delineating venues Based on our results for H1-H4, we see that strategies delineate different venues well. Blogs often focus on scientific terms and the main findings of a paper, press releases emphasize the impact of the findings, and magazines avoid complex scientific terms, instead telling stories and using active, engaging, writing. We visualize these differences by representing each site as a vector of strategy rates (e.g., phys.org would be a vector of length eight) and calculate a single principal component from these vectors using principal component analysis. 7 Figure 3 plots each site along this axis, showing that the four venues we explore (blogs, press releases, magazines, and science journal magazines) group together clearly. The one overlap is science journal magazines, which fall between magazines and press releases. This is especially interesting because the goal of journal magazines is to both advertise research published in the journals (i.e., na-ture.com reports Nature findings) while also being closer in length to a magazine, making them a mixture of both press releases and magazines.
H5: Higher quality science news articles use more STORY and ANALOGY than lower quality articles. To evaluate this hypothesis, we use the corpus of New York Times science articles introduced by Louis and Nenkova (2013a). 8 The corpus consists of three labels of article quality: TYPI-CAL, VERY GOOD, and GREAT. These labels were drawn from whether the article appeared in that year's "Best American Science Writing" anthology (GREAT), was written by an author whose work had appeared in the year's anthology (VERY GOOD), or was neither (TYPICAL). The articles were drawn from the New York Times annotated corpus (Sandhaus, 2008) and filtered for science-related keywords (e.g., biology, biologist).
For a clear differentiation of article quality, we apply our strategy classifiers to only the GREAT and TYPICAL articles in the dataset. We select science articles from the years 2001 to 2007 for a total of 55 GREAT articles (6,211 sentences) and 15,532 TYP-ICAL articles (1,079,768 sentences). 9 To test for significance we perform χ 2 tests of independence and augment these with the φ coefficient, which is similar to Cohen's d as an effect size calculation for categorical variables (Fleiss, 1994).
H6: Press releases that sensationalize the claims of the original journal paper use higher IMPACT and MAIN than press releases that do not. Sumner et al. (2014) introduced a dataset of 462 press releases annotated for the three categories of sensationalism and their associated journal articles from 20 prominent U.K. universities. 10 We split press releases into 'sensationalized' and 'not sensationalized' for each area of sensationalism.
Results: H6 is not supported. The difference in IMPACT and MAIN usage is small and not signficant for all types of sensationalism. Partly this is an encouraging result, as it suggests that using the strategies does not risk sensationalizing the science. Future work might explore strategies that help writers avoid sensationalism.

Related Work
We highlight additional areas of research relevant to our work beyond those already discussed.
Science communication. Over the past twenty years, science communication has shifted from improving scientific literacy to fostering participation in science (Hetland, 2014). A growing body of research shows that scientific literacy is only one of many factors that influence public decision making and cannot be divorced from cultural values (Nisbet and Scheufele, 2009;Bubela et al., 2009).
Scientific writing. There is a wealth of work exploring writing in scientific journals (i.e., when scientists communicate within their discipline). Because of the natural structure of scientific journal papers, much work has looked at ways of automatically identifying content in these papers Guo et al., 2010;Liakata et al., 2012). Kröll et al. (2014) examined the use of guidelines for science journal papers. Our work instead focuses on general science writing.

Future Work
We annotated writing strategies at the sentence level, but some strategies, such as STORY and ANALOGY, might be better annotated at the fragment level to account for longer or shorter use of strategies. Future work can explore more finegrained analysis of these strategies (e.g., with metaphor detectors; Gao et al., 2018). We also hope to build on these findings by exploring how effective the strategies are at engaging different readers.

Conclusion
In this paper we compile writing strategies from theory and practical advice, collect a large corpus and annotate a subset of it to measure strategies' use. We observe how strategies covary with intended audience. For example, blog sites, which target researchers, use more jargon and focus on the main findings of a paper, while magazine articles, which target a much broader audience of readers, tell more stories and use more active voice. Our findings also suggest that science newspaper articles judged by experts to have higher quality use more metaphorical language and tell more stories. We expect that our strategy formulations, classifiers, annotations and dataset will enable NLPpowered tools to support effective science communication for different audiences. Online article Communicating with the Public from AAAS (Gagnier and Fisher, 2017) Online article Explaining Tech to Non-Techies (Bruzzese, 2018) Online article Tips for Communicating Scientific Research to Non Experts (Scientifica) Online  (Ranger and Bultitude, 2016) Journal article Science Journalism (Writing for a General Audience) (Crawford) Book chapter A Field Guide for Science Writers (Blum et al., 2006) Book The Science Writer's Handbook (Hayden et al., 2013) Book

A.1 Open-Coding Details
Using the selected style guides, two authors opencoded (Holton, 2007) the guidelines from each guide and grouped these guidelines into suggested writing strategies. Some resources had lists of guidelines (e.g., "12 ways to. . . "), for which the authors coded each listed guideline as a separate strategy. For resources in prose (e.g., books and academic articles), the authors highlighted all guidance on writing strategies for science communication (e.g., "Have an engaging, new, first sentence."). Because the eventual goal was to identify these strategies in a document, the authors focused on document-specific strategies rather than processspecific strategies (e.g., "make sure to have a friend read the draft before sending it in."). Table 5 lists all resources used.

A.2 Corpus Collection
We selected universities based on the Carnegie Classification of Institutions of Higher Education 11 for large 4-year universities with doctoral programs and very high research activity (i.e., R1 institutions) in the US. We additionally filtered for STEM dominant research institutions. We randomly sampled 10 university websites from this filtered set of universities; however, many universities either did not have a single unified press department (e.g., each school handled press separately), or the majority of press was unrelated to research output. As Table 1 shows, a majority of articles came from blog sites, while few came from press releases. This is due to the fact that press releases focus on research coming from that particular institution, substantially limiting the number of articles produced by these sites.

A.3 Cleaning Keywords
We selected the following keywords for filtering based on inspection of politicized articles from the sites we scraped between 2016 and 2019. All keywords are lower cased: 'trump', 'president', 'republican', 'refugee', 'congress', 'country', 'obama', 'senate', 'white house', 'democrat', 'political', 'epa', 'attorney', 'politics'. An article was considered political if the title contained any of the keywords and the body contained at least 4 of the keywords. We inspected all articles selected for annotation (337) and found none that were political. Table 6 reports Krippendorff's α at the sentence level for each strategy. Because most strategies do not occur often (i.e., usually less than 10% of sentences), simple agreement rate skews high due to a majority of negative examples. α corrects for this skew by taking into account random chance of overlapping annotations.

A.5 Pretraining Details
We followed the pretraining recommendations of Gururangan et al. (2020) and pretrain RoBERTa in two additional steps: domain-and task-adaptive pretraining. Both steps are to tailor the model to domain and task specific language. Domain adaptive pretraining was done on 11.90M articles from REALNEWS (Zellers et al., 2019) for 12.5k training steps and task adaptive pretraining was done on 100k articles from a held out portion of our corpus for 10 epochs. Hyperparameters for pretraining are in Table 7.

A.6 Finetuning Details
Articles were broken down into sentences for classification. We employed random search for hyperparameter tuning with 5-fold cross validation on the training set of the annotated articles. We ran a total of 10 search trials.   hyperparameters for our classifiers. Table 4 reports the precision, recall, and accuracy, calibration error and F 1 scores of the finetuned classifiers on the held-out test set.  PRESENT Use the present tense "Life continues but I don't think Dominica will ever be the same again," John says.

High ACTIVE
The challenges associated with news writing, meanwhile, are...well, they 're challenging.
All four regularly write about policy in popular news outlets -particularly prolific are Frakt and Carroll , who write for The New York Times.' For example, they are more likely to be immigrants. According to a team of scientists led by Nenad Sestan at Yale School of Medicine, this process might play out over a much longer time frame , and perhaps isn't as inevitable or irreparable as commonly believed.
Observations from these scopes could reveal the planet's rotation rate, the composition and thickness of its atmosphere, and whether it has clouds.

Low ACTIVE
These secondary sediments were later eroded in the delta, exposing an inverted relief of the structure that is observed today.
According to the World Health Organization, most significant constituents of air pollution include particulate matter (PM), ozone, nitrogen dioxide, and sulfur dioxide.
Ke, working together with his graduate student Pengfei Wang, was instrumental in advancing the technology to its new version. Being able to touch, explore the shape, feel the weight and even smell the replica of an artefact has the potential to transform cultural heritage experiences. Some deployments might seem unusual .

High PRESENT
A stubborn myth persists that when policymakers manage recreational fishing they 're managing a food source. Professor Tanja Kallio and doctoral candidate Sami Tuomi consider the realisation of this goal entirely possible. "However, scientifically we are in the dark about the consequences of rewilding, and we worry about the general lack of critical thinking surrounding these often very expensive attempts at conservation. They also suggest that angler organizations should be more involved in promoting more responsible management processes and monitoring. First, the carbon nanotube and a solvent are combined in one vessel, while a nitrogen-containing compound and a solvent go into another.

Low PRESENT
The archaeologists identified the remains of Captain Matthew Flinders by the lead plate placed on top of his coffin. His team found a way to reengineer inhibitory interneurons to improve their function. "We were very lucky that Captain Flinders had a breastplate made of lead, meaning it would not have corroded." Near-infrared observations conducted with SPHERE allowed the astronomers to decompose the observed continuum emission into four components : young stellar population (about 120 million years old), hot dust (with a temperature of around 800 K), scattered light from the hidden Seyfert 1 nucleus and a very hot stellar background. These chemicals are potentially found in a huge variety of everyday products, including disinfectants, pesticides and toiletries.

High JARGON
The researchers found that sustained and unprecedented rise in infant mortality in England from 2014 to 2017 was not experienced evenly across the population. Exploiting this reduction of complexity and degree of control the team was able to monitor the microscopic processes in their quantum many body system and to identify ways to enhance and manipulate the magnetic order in their system. Often patients have to stop taking medication because of adverse side effects and wait for their bodies to recover before they can begin again, Shimada said. The next step for Fang and his research team is to develop computer stimulations to understand the effects of nanoparticle shapes sizes and surface modifiers. Exposure to potentially harmful chemicals is a reality of life.

Low JARGON
We are looking for an alternative location outside of Amsterdam, the plan says. These days unlicensed, recognizable portrayals of guns in games look from the outside the same as they did in the days of marketing deals: the guns look real and shoot well. Dora Linda Nishihara was driving in San Antonio one dark evening in early December when she suddenly disappeared from sight . The others didn't respond to requests for comment. "We were told if we become the first couple to do this experiment we'll become famous and HBO already tried to reach me", Yevgenievna says. She has been deaf since birth.