The Materials Science Procedural Text Corpus: Annotating Materials Synthesis Procedures with Shallow Semantic Structures

Materials science literature contains millions of materials synthesis procedures described in unstructured natural language text. Large-scale analysis of these synthesis procedures would facilitate deeper scientific understanding of materials synthesis and enable automated synthesis planning. Such analysis requires extracting structured representations of synthesis procedures from the raw text as a first step. To facilitate the training and evaluation of synthesis extraction models, we introduce a dataset of 230 synthesis procedures annotated by domain experts with labeled graphs that express the semantics of the synthesis sentences. The nodes in this graph are synthesis operations and their typed arguments, and labeled edges specify relations between the nodes. We describe this new resource in detail and highlight some specific challenges to annotating scientific text with shallow semantic structure. We make the corpus available to the community to promote further research and development of scientific information extraction systems.


Introduction
Systematically reducing the time and effort required to synthesize novel materials remains one of the grand challenges in materials science.Massive knowledge bases which tabulate known chemical reactions for organic chemistry (Lawson et al., 2014) have accelerated data-driven synthesis planning and related analyses (Segler et al., 2018;Coley (Dong et al., 2009).Bold red indicates the operations (predicates) involved in the synthesis; bold black indicates arguments; underlines demarcate entity boundaries.et al., 2017).Automated synthesis planning for organic molecules has recently achieved human-level planning performance using massive organic reaction knowledge bases as training data (Segler et al., 2018).There are, however, currently no comprehensive knowledge bases which systematically document the methods by which inorganic materials are synthesized (Kim et al., 2017a,b).Despite efforts to standardize the reporting of chemical and materials science data (Murray-Rust and Rzepa, 1999), inorganic materials synthesis procedures continue to reside as natural language descriptions in the text of journal articles.Figure 1 presents an example such synthesis procedure.To achieve similar success for inorganic synthesis as has been achieved for organic materials, we must develop new techniques for automatically extracting struc-tured representations of materials synthesis procedures from the unstructured narrative in scientific papers (Kim et al., 2017b).
To facilitate the development and evaluation of machine learning models for automatic extraction of materials syntheses from text, in this work we present a new dataset of synthesis procedures annotated with semantic structure by domain experts in materials science.We annotate each step in a synthesis with a structured frame-semantic representation, with all the steps in a synthesis making up a DAG.The types of nodes in the graph include synthesis operations (i.e.predicates), and the materials, conditions, apparatus and other entities participating in each synthesis step.Labeled edges represent relationships between entities, e.g.Condition of or Next Operation.Our dataset consists of 230 synthesis procedures annotated with these structures.An example sentence level annotation is given in Fig. 2. We make the corpus available to the community to promote further research and development of scientific information extraction systems for procedural text.1

Description of the Annotated Dataset
Here we present a description of the structures we annotate ( §2.2), summarize statistics of the dataset ( §2.3), highlight specific annotation decisions ( §2.4) and present inter-annotator agreements ( §2.5).All annotations were performed by three materials scientists using the BRAT2 annotation tool (Stenetorp et al., 2012).

Selecting Synthesis Procedures for Annotation
The 230 synthesis procedures annotated were selected from our database of 2.

Structures Annotated
An annotated graph consists of nodes denoting the participants of synthesis steps and edges denoting relationships between the participants in the synthesis.Operation nodes define the main structure of the graph and the arguments for each operation include different materials, conditions and apparatus.For annotating the text describing a synthesis, we define a set of span-level labels that identify the operations and typed arguments in the text, i.e. the nodes of the graph.We also define a set of relationships between text spans, which label the edges of the synthesis graph.We detail these two kinds of labels as follows.
Span-level Labels: Each span is a sequence of tokens or characters which form one entity mention (e.g."quartz tube furnace").Entity mentions are associated with entity types which specify a category/kind for the entity mention.The 10 most frequent entity types defined for our dataset are listed in Table 1a.We describe a notable subset of the entity types in more detail below: Material: Materials that are used in the synthesis of the target.For ex: Cr 2 O 3 , Strontium carbonate, BaTiO 3 , Li 2 CO 3 , Water, Ethanol.Non-recipe Material: Chemically specified materials that are not used in the synthesis of the synthesis target.For ex: "BaTiO 3 powder (Ba/Ti=0.999)",here the text underlined is the span to be annotated.Operation: Discrete actions physically performed by the researcher or discrete process steps taken to synthesize the target.Material-Descriptor: Describes a material's structure, shape, form, type, role, etc. and must be directly or nearly adjacent to the material it describes.Does not include  For ex: "Graphite oxide was prepared by oxidation of graphite powder according to the modified Hummers' method".Amount-Unit: These describe absolute amounts, concentrations, purities, ratios, flow rates and so on.For ex: mg, mL, M, %.Condition-Unit: These describe the units of measurement for intangible conditions under which operations are performed.For ex: °C, K, sec, RPM, mW.Condition-Misc: Qualitative descriptions of conditions.For ex: Room temperature, Dropwise, Naturally, Vacuum.Synthesis-Apparatus: Equipment used to perform an operation involved in the synthesis.Characterization-Apparatus: Equip-ment used to characterize a materials properties.
Relation Labels: We define a set of relationships between entity mentions, which label the edges of the synthesis graph.A subset of these relations describe direct relationships between operations and their arguments, others describe relationships between argument mentions, and the Next-Operation relation describes relationships between operations so as to step towards annotating full recipe graphs.The different relation labels we define are tabulated in Table 1b, a subset of these are defined below: Recipe-target: Indicates a material assigned to an operation which is the target of the synthesis procedure.Participant-material: A material that is part of a particular synthesis step.Recipe-precursor: Indicates a material which is the source of an element for the target material used in a specific synthesis operation.Apparatus-of : Denotes an apparatus to be used in  We refer readers to our annotation guidelines for definitions of the complete set of entity type and relation labels in the dataset.

Dataset Statistics
Some key statistics of the dataset such as number of documents, tokens, entities and unique operations are listed in  1b) 14 Avg.sentence length (Fig. 3b) 26 Avg.sentences/synthesis (Fig. 3a) 9

Annotation Decisions
Next we highlight specific points of contention in creating the current set of annotations.
What constitutes an operation?:While our definition of the Operation entity type specifies only actions performed by a lab researcher to be valid operations, there are cases where our annotations diverge from this definition.This happens in the following cases: • Cases where an operation isn't explicitly performed by the researcher.For ex: "After this, the autoclave was cooled to room temperature naturally".• Cases with nested verb structures.For ex: "white precipitate which was harvested by centrifugation . . .".
In the current set of annotations, we allow experts to decide when a particular candidate operation should be considered valid and when it can be omitted.As our inter-annotator agreements will demonstrate, experts tend to agree often on what should be considered an operation.The question of what constitutes an operation is analogous to the notion of what constitutes an "event" in the broader NLP literature as highlighted by Mostafazadeh et al. (2016).
Argument state and argument re-use: Annotation of semantic structures often allow for argument spans to have multiple parents (Surdeanu et al., 2008;Banarescu et al., 2013;Oepen et al., 2015).For example in Figure 2, the material "Cu" could be considered an argument of the operations "placed" and "heated".Allowing for arguments to have multiple parents however runs into complications when the operation causes the state of a material to change (incidentally, this is not the case in the example we highlight above).When a materials state changes due to a specific operation, considering the same text span to be the argument of a different operation would not be chemically valid.Therefore the current set of annotations does not allow for arguments to have multiple parents.The tracking of state itself is also complicated by the difficulty in being able to write down precise states at a meaningful level of granularity for all possible materials, this is further complicated by the ambiguity presented by under-specified materials in synthesis text, for ex: "clear solution" and "black slurry" in Fig. 1.
Relations across sentences: Often, in synthesis procedures, a given synthesis step is described across multiple sentences.In these cases it would be meaningful to allow for relationships between operation-argument entities which are in different sentences.For the sake of simplicity and to stick more closely to a sentence level shallow semantic annotation, our current iteration of the annotations has avoided this annotation, however a very small number of instances of cross-sentence relations do exist (< 1% of all relations in the dataset).An example of this type is as follows: First, sulfuric acid and nitric acid were mixed well by stirring 15 min in an ice bath, and then graphite powder was dispersed into the solution.After 15 min, potassium chlorate was added into the system -very slowly to prevent strong reaction during the oxidation process.Here "min" and "dispersed" are related by a Condition Of relation.Synthesis procedures which required annotation primarily of cross-sentence relations were ignored.

Inter-annotator Agreement
Next we report a host of inter-annotator agreements for the different levels semantic annotation in our dataset.The agreements we report are based on a collection of 5 documents which were annotated separately by all three expert annotators.All the numbers we report are averages over agreements between all pairs of 3 expert annotators.Span-level Labels: Agreements on span level labels correspond to the agreement on entity type labels assigned to individual tokens.We observe an overall percent agreement on the token level labels to be 90.1%.A break down of this agreement by the entity type is presented in Table 3a.As this indicates, there seems to be high agreement on labels which have clear definitions; for e.g.Number, Amount-Unit.Labels which by definition are a lot more ambiguous, however, see lower agreement; for e.g.Material-Descriptor, Condition-Misc.
Relation Labels: Agreement on relation labels are were computed for the set of cases where a pair of annotators agreed on the token level annotations.
For a pair of entities, if both annotators indicate the same relation type the annotators are considered to be in agreement.For relation labels we observe an percent agreement of 86.9%.

Related Work
Shallow semantic parsing in NLP: Prior work in the NLP community has defined and annotated semantic structures for text.These structured representations often seek to generalize about sentence level predicate-argument structure; abstracting away from the surface nuances of natural language and representing its semantics (Abend and Rappoport, 2017).A large body of work has created these resources for non-scientific text as done in, PropBank (Palmer et al., 2005;Surdeanu et al., 2008), FrameNet (Fillmore andBaker, 2010), AMR (Banarescu et al., 2013), semantic dependencies (Oepen et al., 2015) and ACE event schemas (Doddington et al., 2004).The GENIA project has defined event structures for biomedical data (Kim et al., 2003) while Garg et al. (2016) extended the AMR framework to biomedical text.Closer still to the work presented here, Mori et al. (2014) have annotated cooking recipes with sentence and discourse level semantic relations.There has also been an interest in labeling scientific wetlab protocol text, with semantic structures and to facilitate training supervised models for the extraction of these structures (Kulkarni et al., 2018).Kulkarni et al. make use of an altered version of the EXACT2 ontology, created for the annotation of biomedical procedural text (Soldatova et al., 2014).The dataset presented here can be viewed to fit within the theme of sentence level semantics for procedural text, specifically tailored to materials science synthesis.
Materials Science & Chemistry: Prior work in the materials science community have shown that manual extraction and subsequent text mining can be an effective approach to analysis of synthesis routes for specific materials (Raccuglia et al., 2016;Ghadbeigi et al., 2015); these approaches however have been limited by scale.There has also been a consensus that comprehensively extracting the knowledge contained within written inorganic materials syntheses is a key step towards reducing the overall discovery and development time for novel materials (Butler et al., 2018).The focus of existing resources in the materials science community, however, has been on materials struc-tures and properties knowledge bases (Jain et al., 2013), rather than reactions and synthesis.In pursuit of more scalable methods for materials synthesis data extraction, Young et al. (2018) have made use of automated methods for extracting specific categories of materials synthesis parameters, while Mysore et al. (2017) and Kim et al. (2017a) have both presented various methods for automated text extraction from materials science literature.However, these lines of work have not presented general purpose annotated data with which to train information extraction models.We believe this work fills that gap.

Conclusion and Future Directions
In this work we present a shallow semantic parsing dataset consisting of 230 synthesis procedures.The dataset was annotated by domain experts in materials science.We also highlight specific difficulties in the annotation process and present agreement metrics on the different levels of our annotation.We believe the dataset will enable the development of robust supervised entity tagging models and is suitable for evaluating models trained to extract shallow semantic structures.
Future work in the development of this dataset could involve methods for the scaling up of the annotation process, perhaps by adapting the guidelines to enable annotation by non-experts at some stages of the annotation process.We also plan to add additional layers of annotation, including: coreference relations between synthesis steps, states of argument entities, and linking annotated entities to entries in materials science knowledge bases such as The Materials Project. 5

Figure 1 :
Figure 1: Example synthesis procedure text from a materials journal article(Dong et al., 2009).Bold red indicates the operations (predicates) involved in the synthesis; bold black indicates arguments; underlines demarcate entity boundaries.

Figure 2 :
Figure 2: An example annotated sentence.Shallow semantic structures generally consist of verbal predicates and arguments of these predicates as nodes and labeled edges between predicate and argument nodes, e.g.Heated(Condition of : degC, Atmospheric Material: H2, Contidion of : mTorr).We also label relations between argument entities and non-predicate entities, for e.g.Descriptor of (Cu, foils) and relations between predicates, for e.g.Next Operation(placed, heated).
(a) Sentences per synthesis document.(b) Tokens per sentence.

Figure 3 :
Figure 3: Sentences count statistics of the corpus; On average a synthesis procedure contains 9 sentences, each of which contain 26 tokens on average.

Table 1 :
Entity types and relation labels annotated in our dataset.The table (a) depicts the 10 most frequent of the 21 entity types defined in our dataset, and the table (b) includes 14 relation labels among entities.amounts, concentrations, or purities of materials.For ex: CaCu 3 Ti 4 O 12 compound, Copper ion, GaAs nanowires, Anatase TiO 2 , Deionized water.Meta: A canonical name to specify a particular overall synthesis method used for synthesis.

Table 2 :
Various dataset statistics.Additional details provided in referred figures.To determine unique operations, mentions are lemmatized with the WordNet lemmatizer.

Table 3 :
Annotator agreements in our dataset.The table (a) depicts the percent agreements on 10 most frequent of the 21 entity types defined in our dataset, and the table (b) denotes overall agreements on the different annotations in our dataset.