Finding Pragmatic Differences Between Disciplines

Scholarly documents have a great degree of variation, both in terms of content (semantics) and structure (pragmatics). Prior work in scholarly document understanding emphasizes semantics through document summarization and corpus topic modeling but tends to omit pragmatics such as document organization and flow. Using a corpus of scholarly documents across 19 disciplines and state-of-the-art language modeling techniques, we learn a fixed set of domain-agnostic descriptors for document sections and “retrofit” the corpus to these descriptors (also referred to as “normalization”). Then, we analyze the position and ordering of these descriptors across documents to understand the relationship between discipline and structure. We report within-discipline structural archetypes, variability, and between-discipline comparisons, supporting the hypothesis that scholarly communities, despite their size, diversity, and breadth, share similar avenues for expressing their work. Our findings lay the foundation for future work in assessing research quality, domain style transfer, and further pragmatic analysis.


Introduction
Disciplines such as art, physics, and political science contain a wide array of ideas, from specific hypotheses to wide-reaching theories. In scholarly research, authors are faced with the challenge of clearly articulating a set of those ideas and relating them to each other, with the ultimate goal of expanding our collective knowledge. In order to understand this work, human readers situate meaning in context (Justin Garten and Deghani, 2019). Similarly, methods for scholarly document processing (SDP) have semantic and pragmatic orientations.
The semantic orientation seeks to understand and evaluate the ideas themselves through information extraction (Singh et al., 2016), summariza-tion (Chandrasekaran et al., 2020), automatic factchecking (Sathe et al., 2020), etc. The pragmatic orientation, on the other hand, seeks to understand the context around those ideas through rhetorical and style analysis (August et al., 2020), corpus topic modeling (Paul and Girju, 2009), quality prediction (Maillette de Buy Wenniger et al., 2020), etc. Although both orientations are essential for understanding, the pragmatics of disciplinary writing are very weakly understood.
In this paper, we investigate the structures of disciplinary writing. We claim that a "structural archetype" (defined in Section 3) can succinctly capture how a community of authors choose to organize their ideas for maximum comprehension and persuasion. Analogous to how syntactic analysis deepens our understanding of a given sentence and document structure analysis deepens our understanding of a given document, structural archetypes, we argue, deepen our understanding of domains themselves.
In order to perform this analysis, we classify sections according to their pragmatic intent. We contribute a data-driven method for deriving the types of pragmatic intent, called a "structural vocabulary", alongside a robust method for this classification. Then, we apply these methods to 19k scholarly documents and analyze the resulting structures.

Related Work
We draw from two areas of related work in SDP: interdisciplinary analysis and rhetorical structure prediction.
In interdisciplinary analysis, we are interested in comparing different disciplines, whether by topic modeling between select corpora/disciplines (Paul and Girju, 2009) or by domain-agnostic language modeling . These comparisons are more than simply interesting; they allow for models that can adapt to different disciplines, helping the generalizability for downstream tasks like information extraction and summarization.
In rhetorical structure prediction, we are interested in the process of implicature, whether by describing textual patterns in an unsupervised way (Ó Séaghdha and Teufel, 2014) or by classifying text as having a particular strategy like "statistics" (Al-Khatib et al., 2017) or "analogy" (August et al., 2020). These works descend from argumentative zoning (Lawrence and Reed, 2020) and the closely related rhetorical structure theory (Mann and Thompson, 1988), which argue that many rhetorical strategies can be described in terms of units and their relations. These works are motivated by downstream applications such as predicting the popularity of a topic (Prabhakaran et al., 2016) and classifying the quality of a paper (Maillette de Buy Wenniger et al., 2020).
Most similar to our work is Arnold et al. (2019). Here, the authors provide a method of describing Wikipedia articles as a series of section-like topics (e.g. disease.symptom) by clustering section headings into topics and then labeling words and sentences with these topics. We build on this work by using domain-agnostic descriptors instead of domain-specific ones and by comparing structures across disciplines.

Methods
In this section, we define structural archetypes (3.1) and methods for classifying pragmatic intent through a structural vocabulary (3.2).

Structural Archetypes
We coin the term "structural archetype" to focus and operationalize our pragmatic analysis. Here, a "structure" is defined as a sequence of domainagnostic indicators of pragmatic intent, while an "archetype" refers to a strong pattern across documents. In the following paragraphs, we discuss the components of this concept in depth.
Pragmatic Intent In contrast to verifiable propositions, "indicators of pragmatic intent" refer to instances of meta-discourse, comments on the document itself (Ifantidou, 2005). There are many examples, including background (comments on what the reader needs in order to understand the content), discussions (comments on how results should be interpreted), and summaries (comments on what is important). These indicators of pragmatic intent serve the critical role of helping readers "digest" material; without them, scholarly documents would only contain isolated facts.
We note that the boundary between pragmatic intent and argumentative zones (Lawrence and Reed, 2020) is not clear. Some argumentative zones are more suitable for the sentence-and paragraph-level (e.g. "own claim" vs. "background claim") while others are interpretative (e.g. "challenge"). This work does not attempt to draw this boundary, and the reader might find overlap between argumentative zoning work and our section types.
Sequences As a sequence, these indicators reflect how the author believes their ideas should best be received in order to remain coherent. For example, many background indicators reflects a belief that the framing of the work is very important.
Domain-agnostic archetypes Finally, the specification that indicators must be domain-agnostic and that the structures should be widely-held are included to allow for cross-disciplinary comparisons.
We found that the most straightforward way to implement structural archetypes is through classifying section headings according to their pragmatic intent. With this comes a few challenges: (1) defining a set of domain-agnostic indicators, which we refer to as a "structural vocabulary"; (2) parsing a document to obtain its structure; and (3) finding archetypes from document-level structures. In the proceeding section, we address (1) and (2), and in Section 4 we address (3).

Deriving a Structural Vocabulary
Although indicators of pragmatic intent can exist on the sentence level, we follow Arnold et al. (2019) and create a small set of types that are loosely related to common section headings (e.g. "Methods"). We call this set a "structural vocabulary" because it functions in an analogous way to a vocabulary of words; any document can be described as a sequence of items that are taken from this vocabulary. There are three properties that the types should satisfy: A. domain independence: types should be used by different disciplines B. high coverage: unlabeled instances should be able to be classified as a particular type.
C. internal consistency: types should accurately reflect their instances Domain Independence As pointed out by Arnold et al. (2019), there exists a "vocabulary mismatch problem" where different disciplines talk about their work in different ways. Indeed, 62% of the sampled headings only appear once and are not good choices for section types. On the other hand, the most frequent headings are a much better choice, especially those that appear in all domains.
After merging a few popular variations among the top 20 section headings (e.g. conclusion and summary, background and related work), we yield the following types 1 : introduction (a section which introduces the reader to potentially new concepts; n = 10916), methods (a section which details how a hypothesis will be tested; n = 2116), results (a section which presents findings of the method; n = 3119), discussion (a section which interprets and summarizes the results; n = 3118), conclusion (a section which summarizes the entire paper; n = 7738), analysis (a section which adds additional depth and nuance to the results; n = 951), and background (a section which connects ongoing work to previous related work; n = 800). Figure 2 contains discipline-level counts.
High Coverage We can achieve high coverage by classifying any section as one of these section types through language modeling. Specifically, the hidden representation of a neural language model h(·) can act as an embedding of its input. We use the [CLS] tag of SciBERT's hidden layer, selected for its robust representations of scientific literature (Beltagy et al., 2019). To classify, we define a distance score d(·) for a section s and a type T as the distance between h(s) and the average embedding across all instances of a type, i.e.

d(s, T ) = h(s) − t∈T h(t) T
Note that since the embedding is a vector, addition and division are elementwise. Then, we compute the distance for all types in the vocabulary V and select the minimum, i.e. s type = arg min T ∈V

(d(s, T ))
Internal Consistency Some sections do not adequately fit any section type, so nearest-neighbor classification will result in very inconsistent clusters. We address this problem by imposing a threshold on the maximum distance for d(·). Further, since the types have unequal variance (that is, the ground truth for some types are more consistent than other types), we define a type-specific threshold as half of the distance from the center of T to the furthest member of T , i.e.
The weight of 0.5 was found to remove outliers appropriately an maximize retrofitting performance (Section 4.2).
We also note that some headings, especially brief ones, leave much room for interpretation and make retrofitting challenging. We address this problem by concatenating tokens of each section's heading and body, up to 25 tokens, as input to the language model. This ensures that brief headings contain enough information to make an accurate representation without including too many details from the body text.

Data
We use the Semantic Scholar Open Research Corpus (S2ORC) for all analysis (Lo et al., 2020). This corpus, which is freely available, contains approximately 7.5M PDF-parsed documents from 19 disciplines, including natural sciences, social sciences, arts, and humanities. For our experiments, we randomly sample 1k documents for each discipline, yielding a total of 19k documents.

Retrofitting Performance
Retrofitting (or normalizing) section headers refers to re-labeling sections with the structural vocabulary. We evaluate retrofitting performance by manually tagging 30 of each section type and comparing the true labels to the predicted values. Our method yields an average F1 performance of 0.76. The breakdown per section type, shown in Table 1, reveals that conclusion, background, and analysis sections were the most difficult to predict. We attribute this to a lack of textual clues in the heading and body, and also a semantic overlap with introduction sections. Future work can improve the classifier with more nuanced signals, such as position, length, number of references, etc.

Analyzing Position with Aggregate Frequency
A simple yet expressive way of showing the structural archetypes of a discipline is to consider the frequency of a particular type at any point in the article (normalized by length). This analysis reveals general trends throughout a discipline's documents, such as where a section type is most frequent or where there is homogeneity.
To illustrate the practicality of this analysis, consider the hypothesis that Physics articles are more empirically-motivated while Political Science articles are more conceptually-motivated, i.e. that they are on opposing ends of the concrete versus abstract spectrum. We operationalize this by claiming that Physics articles have more methods, results, and analysis sections than Political Science. Figure 1 shows the difference between Physics and Political Science at each point in the article. It reveals that not only do Physics articles contain more methods and results, but also that Physics articles introduce methods earlier than Political Science, and that both contain the same amount of analysis sections.

Analyzing Ordering with State Transitions
A more structural analysis of a discipline is to look at the frequency of sequence fragments through computing transition probabilities. As a second example, suppose we have a more nuanced hypothesis: that Psychology papers tend to separate claims and evaluate them sequentially (methods, results, discussion, repeat) whereas Sociology papers tend to evaluate all claims at once. We can operationalize these hypotheses by calculating the transition probability between section s i and s i−1 conditioned on some discipline. In Table 2, we see evidence that methods sections are more likely to be preceded by results sections in Psychology than Sociology, implying a new iteration of a cycle. We might conclude that Psychology papers are more likely to have cyclical experiments, but not that Sociology papers conduct multiple experiments in a linear fashion.

Conclusion and Future Work
In this paper, we have shown a simple method for constructing and comparing structural archetypes across different disciplines. By classifying the pragmatic intent of section headings, we can visualize structural trends across disciplines. In addition to utilizing a more complex classifier, future directions for this work include (1) further distinguishing between subdisciplines (e.g. abnormal psychology vs. developmental psychology) and document type (e.g. technical report vs. article); (2) learning relationships between structures and measures of research quality, such as reproducibility; (3) learning how to convert one structure into another, with the ultimate goal of normalizing them for easier comprehension or better models; (4) deeper investigations into the selection of a structural vocabulary, such as including common argumentative zoning types or adjusting the scale to the sentence-level; and (5) drawing comparisons, such as by clustering, between different documents based strictly on their structure.
A Section Counts Before and After Retrofitting