The SOFC-Exp Corpus and Neural Approaches to Information Extraction in the Materials Science Domain

This paper presents a new challenging information extraction task in the domain of materials science. We develop an annotation scheme for marking information on experiments related to solid oxide fuel cells in scientific publications, such as involved materials and measurement conditions. With this paper, we publish our annotation guidelines, as well as our SOFC-Exp corpus consisting of 45 open-access scholarly articles annotated by domain experts. A corpus and an inter-annotator agreement study demonstrate the complexity of the suggested named entity recognition and slot filling tasks as well as high annotation quality. We also present strong neural-network based models for a variety of tasks that can be addressed on the basis of our new data set. On all tasks, using BERT embeddings leads to large performance gains, but with increasing task complexity, adding a recurrent neural network on top seems beneficial. Our models will serve as competitive baselines in future work, and analysis of their performance highlights difficult cases when modeling the data and suggests promising research directions.


Introduction
The design of new experiments in scientific domains heavily depends on domain knowledge as well as on previous studies and their findings.However, the amount of publications available is typically very large, making it hard or even impossible to keep track of all experiments conducted for a particular research question.Since scientific experiments are often time-consuming and expensive, effective knowledge base population methods for finding promising settings based on the published research would be of great value (e.g., Auer et al., 2018;Manica et al., 2019;Strötgen et al., 2019;Mrdjenovich et al., 2020).While such real-life information extraction tasks have received consid- erable attention in the biomedical domain (e.g., Cohen et al., 2017;Demner-Fushman et al., 2018, 2019), there has been little work in other domains (Nastase et al., 2019), including materials science (with the notable exception of the work by Mysore et al., 2017Mysore et al., , 2019)).
In this paper, we introduce a new information extraction use case from the materials science domain and propose a series of new challenging information extraction tasks.We target publications about solid oxide fuel cells (SOFCs) in which the interdependence between chosen materials, measurement conditions and performance is complex (see Figure 1).For making progress within natural language processing (NLP), the genre-domain combination presents interesting challenges and characteristics, e.g., domain-specific tokens such as material names and chemical formulas.
We provide a new corpus of open-access scientific publications annotated with semantic frame information on experiments mentioned in the text.The annotation scheme has been developed jointly with materials science domain experts, who subsequently carried out the high-quality annotation.We define an "Experiment"-frame and annotate sentences that evoke this frame with a set of 16 possible slots, including among others AnodeMaterial, FuelUsed and WorkingTemperature, reflecting the arXiv:2006.03039v1[cs.CL] 4 Jun 2020 role the referent of a mention plays in an experiment.Frame information is annotated on top of the text as graphs rooted in the experiment-evoking element (see Figure 1).In addition, slot-filling phrases are assigned one of the types MATERIAL, VALUE, and DEVICE.
The task of finding experiment-specific information can be modeled as a retrieval task (i.e., finding relevant information in documents) and at the same time as a semantic-role-labeling task (i.e., identifying the slot fillers).We identify three sub-tasks: (1) identifying sentences describing relevant experiments, (2) identifying mentions of materials, values, and devices, and (3) recognizing mentions of slots and their values related to these experiments.We propose and compare several machine learning methods for the different sub-tasks, including bidirectional long-short term memory (BiLSTM) networks and BERT-based models.In our results, BERT-based models show superior performance.However, with increasing complexity of the task, it is beneficial to combine the two approaches.
With the aim of fostering research on challenging information extraction tasks in the scientific domain, we target the domain of SOFC-related experiments as a starting point.Our findings based on this sample use case are transferable to similar experimental domains, which we illustrate by applying our best model configurations to a previously existing related corpus (Mysore et al., 2019), achieving state-of-the-art results.
We sum up our contributions as follows: • We develop an annotation scheme for marking information on materials-science experiments on scientific publications (Section 3).
• We provide a new corpus of 45 materialsscience publications in the research area of SOFCs, manually annotated by domain experts for information on experimental settings and results (Section 4).Our corpus is publicly available.1Our inter-annotator agreement study provides evidence for high annotation quality (Section 5).
• We identify three sub-tasks of extracting experiment information and provide competitive baselines with state-of-the-art neural network approaches for them (Sections 4, 6, 7).
• We show the applicability of our findings to modeling the annotations of another materialsscience corpus (Mysore et al., 2019, Section 7).

Related work
Information extraction for scientific publications.Recently, several studies addressed information extraction and knowledge base construction in the scientific domain (Augenstein et al., 2017;Luan et al., 2018;Jiang et al., 2019;Buscaldi et al., 2019).We also aim at knowledge base construction but target publications about materials science experiments, a domain understudied in NLP to date.
Information extraction for materials science.The work closest to ours is the one of Mysore et al. (2019) who annotate a corpus of 230 paragraphs describing synthesis procedures with operations and their arguments, e.g., "The resulting [solid products M aterial ] were ...
." Operation-evoking elements ("dried") are connected to their arguments via links, and with each other to indicate temporal sequence, thus resulting in graph structures similar to ours.Their annotation scheme comprises 21 entity types and 14 relation types such as Participant-material, Apparatus-of and Descriptor-of.Kononova et al. (2019) also retrieve synthesis procedures and extract recipes, though with a coarser-grained label set, focusing on different synthesis operation types.Weston et al. (2019) create a dataset for named entity recognition on abstracts of materials science publications.In contrast to our work, their label set (e.g., Material, Application, Property) is targeted to document indexing rather than information extraction.A notable difference to our work is that we perform full-text annotation while the aforementioned approaches annotate a pre-selected set of paragraphs (see also Kim et al., 2017).Mysore et al. (2017) apply the generative model of Kiddon et al. (2015) to induce action graphs for synthesis procedures of materials from text.In Section 7.1, we implement a similar entity extraction system and also apply our algorithms to the dataset of Mysore et al. (2019).Tshitoyan et al. (2019) train word2vec (Mikolov et al., 2013) embeddings on materials science publications and show that they can be used for recommending materials for functional applications.Other works adapt the BERT model to clinical and biomedical domains (Alsentzer et al., 2019;Sun and Yang, 2019), or generally to scientific text (Beltagy et al., 2019).

Annotation Scheme
In this section, we describe our annotation scheme and guidelines for marking information on SOFCrelated experiments in scientific publications.

Experiment-Describing Sentences
We treat the annotation task as identifying instances of a semantic frame (Fillmore, 1976) that represents SOFC-related experiments.We include (1) cases that introduce novel content; (2) descriptions of specific previous work; (3) general knowledge that one could find in a textbook or survey; and also (4) suggestions for future work.
We assume that a frame is introduced to the discourse by words that evoke the frame.While we allow any part-of-speech for such frame-evoking elements, in practice, our annotators marked almost only verbs, such as "test," "perform," and "report" with the type EXPERIMENT.In the remainder of this paper, we treat all sentences containing at least one such annotation as experiment-describing.

Entity Mention Types
In a second annotation layer, annotators mark spans with one of the following entity types.The annotations are marked only on experiment-describing sentences as well as several additional sentences selected by the annotator.
MATERIAL.We use the type MATERIAL to annotate text spans referring to materials or elements.They may be specified by a particular composition formula (e.g., "La 0.75 Sr 0.25 Cr 0.5 Mn 0.5 O 3 ") or just by a mention of the general class of materials, such as "oxides" or "hydrocarbons."2VALUE.We annotate numerical values and their respective units with the type VALUE.
DEVICE.This label is used to mark mentions of the type of device used in the fuel cell experiment (e.g., "IT-SOFC").

Experiment Slot Types
The above two steps of recognizing relevant sentences and marking coarse-grained entity types are in general applicable to a wide range of experiment types within the materials science domain.We now define a set of slot types particular to experiments on SOFCs.During annotation, we mark these slot types as links between the experimentevoking phrase and the respective slot filler (entity mention), see Figure 1.As a result, experiment frames are represented by graphs rooted in the node corresponding to the frame-evoking element.
Our annotation scheme comprises 16 slot types relevant for SOFC experiments.Here we explain a few of these types for illustration.A full list of these slot types can be found in Supplementary Material Table 11; detailed explanations are given in the annotation guidelines published along with our corpus.
AnodeMaterial, CathodeMaterial: These slots are used to mark the fuel cell's anode and cathode, respectively.Both are entity mentions of type MATERIAL.In some cases, simple surface information indicates that a material fulfills such a role.Other cases require specific domain knowledge and close attention to the context.
FuelUsed: This slot type indicates the chemical composition or the class of a fuel or the oxidant species (indicated as a MATERIAL).
PowerDensity, Resistance, WorkingTemperature: These slots are generally filled by mentions of type VALUE, i.e., a numerical value plus a unit.Our annotation guidelines give examples for relevant units and describe special cases.This enables any materials scientist, even if he/she is not an expert on SOFCs, to easily understand and apply our annotation guidelines.
Difficult cases.We also found sentences that include enumerations of experimental settings such as in the following example: "It can be seen that the electrode polarization resistances in air are 0.027 Ωcm 2 , 0.11 Ωcm 2 , and 0.88 Ωcm 2 at 800 • C, 700 • C and 600 • C, respectively."3We decided to simply link all slot fillers (the various resistance and temperature values) to the same frame-evoking element, leaving disentangling and grouping of this set of parameters to future work.

Links between Experiments
We instruct our annotators to always link slot fillers to the syntactically closest EXPERIMENT mention.
If the description of an experiment spans more than one clause, we link the two relevant EXPERIMENTs using the relation same exp.We use exp variation to link experiments done on the same cell, but with slightly different operating conditions.The link type exp variation can also relate two frameevoking elements that refer to two measurements performed on different materials/cells, but in the same experimental conditions.In this case, the frame-evoking elements usually convey an idea of comparison, e.g., "increase" or "reach from ... to."

Corpus Statistics and Task Definitions
In this section, we describe our new corpus and propose a set of information extraction tasks that can be trained and evaluated using this dataset.
SOFC-Exp Corpus.Our corpus consists of 45 open-access scientific publications about SOFCs and related research, annotated by domain experts.
For manual annotation, we use the InCeption annotation tool (Klie et al., 2018).Table 1 shows the key statistics for our corpus.Sentence segmentation was performed automatically. 4 As a preparation for experimenting with the data, we manually remove all sentences belonging to the Acknowledgment and References sections.We propose the experimental setting of using the training data in a 5-fold cross validation setting for development and tuning, and finally applying the model(s) to the independent test set.
Task definitions.Our rich graph-based annotation scheme allows for a number of information extraction tasks.In the scope of this paper, we address the following steps of (1) identifying sentences that describe SOFC-related experiments, (2) recognizing and typing relevant named entities, and (3) extracting slot fillers from these sentences.The originally annotated graph structures would also allow for modeling as relations or dependency structures.We leave this to future work.The setup of our tasks is based on the assumption that in most cases, one sentence describes a single experiment.The validity of this assumption is supported by the observation that in almost all sentences containing more than one EXPERIMENT, experiment-evoking verbs actually describe variations of the same experiment.(For details on our analysis of links between experiments, see Supplementary Material Section B.) In our automatic modeling, we treat slot types as entity-types-incontext, which is a valid approximation for information extraction purposes.We leave the tasks of deciding whether two experiments are the same (same exp) or whether they constitute a variation (exp variation) to future work.While our dataset provides a good starting point, tackling these tasks will likely require collecting additional data.

Inter-annotator Agreement Study
We here present the results of our inter-annotator agreement study, which we perform in order to estimate the degree of reproducibility of our corpus and to put automatic modeling performance into perspective.Six documents (973 sentences) have been annotated independently both by our primary annotator, a graduate student of materials science, and a second annotator, who holds a Ph.D. in physics and is active in the field of materials science.The label distribution in this subset is similar to the one of our overall corpus, with each annotator choosing EXPERIMENT about 11.8% of the time.Identification of experiment-describing sentences.Agreement on our first task, judging whether a sentence contains relevant experimental information, is 0.75 in terms of Cohen's κ (Cohen, 1968), indicating substantial agreement according to Landis and Koch (1977).The observed agreement, corresponding to accuracy, is 94.9%; expected agreement amounts to 79.2%.Table 2 shows precision, recall and F1 for the doubly-annotated subset, treating one annotator as the gold standard and the other one's labels as predicted.Our primary annotator identifies 119 out of 973 sentences as experiment-describing, our secondary annotator 111 sentences, with an overlap of 90 sentences.These statistics are helpful to gain further intuition of how well a human can reproduce another annotator's labels and can also be considered an upper bound for system performance.
Entity mention detection and type assignment.
As mentioned above, relevant entity mentions and their types are only annotated for sentences containing experiment information and neighboring sentences.Therefore, we here compute agreement on the detection of entity mention and type assignment on the subset of 90 sentences that both annotators considered as containing experimental information.We again look at precision and recall of the annotators versus each other, see Table 3.
The high precision indicates that our secondary annotator marks essentially the same mentions as our primary annotator, but recall suggests a few missing cases.The difference in marking EXPERI-MENT can be explained by the fact that the primary annotator sometimes marks several verbs per sentence as experiment-evoking elements, connecting them with same exp or exp variation, while the secondary annotator links the mentions of relevant slots to the first experiment-evoking element (see also Supplementary Material Section B).Overall, the high agreement between domain expert annotators indicates high data quality.Identifying experiment slot fillers.We compute agreement on the task of identifying the slots of an experiment frame filled by the mentions in a sentence on the subset of sentences that both annotators marked as experiment-describing.Slot fillers are the dependents of the respective edges starting at the experiment-evoking element.Table 4 shows F1 scores for the most frequent ones among those categories.See Supplementary Material Section C for all slot types.Overall, our agreement study provides support for the high quality of our annotation scheme and validates the annotated dataset.

Modeling
In this section, we describe a set of neural-network based model architectures for tackling the various information extraction tasks described in Section 4.
Experiment detection.The task of experiment detection can be modeled as a binary sentence classification problem.It can also be conceived as a retrieval task, selecting sentences as candidates for experiment frame extraction.We implement a bidirectional long short-term memory (BiLSTM) model with attention for the task of experiment sentence detection.Each input token is represented by a concatenation of several pretrained word embeddings, each of which is fine-tuned during training.We use the Google News word2vec embeddings (Mikolov et al., 2013), domain-specific word2vec embeddings (mat2vec, Tshitoyan et al., 2019, see also Section 2), subword embeddings based on byte-pair encoding (bpe, Heinzerling and Strube, 2018), BERT (Devlin et al., 2019), and SciBERT (Beltagy et al., 2019) embeddings.For BERT and SciBERT, we take the embeddings of the first word piece as token representation.The embeddings are fed into a BiLSTM model followed by an attention layer that computes a vector for the whole sentence.Finally, a softmax layer decides whether the sentence contains an experiment.
In addition, we fine-tune the original (uncased) BERT (Devlin et al., 2019) as well as SciBERT (Beltagy et al., 2019) models on our dataset.Sci-BERT was trained on a large corpus of scientific text.We use the implementation of the BERT sentence classifier by Wolf et al. (2019) that uses the CLS token of BERT as input to the classification layer. 5inally, we compare the neural network models with traditional classification models, namely a support vector machine (SVM) and a logistic regression classifier.For both models, we use the following set of input features: bag-of-words vectors indicating which 1-to 4-grams and part-of-speech tags occur in the sentence. 6ntity mention extraction.For entity and concept extraction, we use a sequence-tagging approach similar to (Huang et al., 2015;Lample et al., 2016), namely a BiLSTM model.We use the same input representation (stacked embeddings) as above, which are fed into a BiLSTM.The subsequent conditional random field (CRF, Lafferty et al., 2001) output layer extracts the most probable label sequence.To cope with multi-token entities, we convert the labels into BIO format.
We also fine-tune the original BERT and SciB-ERT sequence tagging models on this task.Since we use BIO labels, we extend it with a CRF output layer to enable it to correctly label multi-token mentions and to enable it to learn transition scores between labels.As a non-neural baseline, we train a CRF model using the token, its lemma, part-ofspeech tag and mat2vec embedding as features. 7lot filling.As described in Section 4, we approach the slot filler extraction task as fine-grained entity-typing-in-context, assuming that each sentence represents a single experiment frame.We use the same sequence tagging architectures as above for tagging the tokens of each experimentdescribing sentence with the set of slot types (see Table 11).Future work may contrast this sequence tagging baseline with graph-induction based frame extraction.

Experiments
In this section, we present the experimental results for detecting experiment-describing sentences, entity mention extraction and experiment slot identification.For tokenization, we employ ChemDataExtractor,8 which is optimized for dealing with chemical formulas and unit mentions.
We tune our models in a 5-fold cross-validation setting.We also report the mean and standard deviation across those folds as development results.For the test set, we report the macro-average of the scores obtained when applying each of the five models to the test set.To put model performance in relation to human agreement, we report the corresponding statistics obtained from our interannotator agreement study (Section 5).Note that these numbers are based on a subset of the data and are hence not directly comparable.
Hyperparameters and training.The BiLSTM models are trained with the Adam optimizer (Kingma and Ba, 2015) with a learning rate of 1e-3.For fine-tuning the original BERT models, we follow the configuration published by Wolf et al. (2019) and use AdamW (Loshchilov and Hutter, 2019) as optimizer and a learning rate of 4e-7 for sentence classification and 1e-5 for sequence tagging.When adding BERT tokens to the BiLSTM, we also use the AdamW optimizer for the whole model and learning rates of 4e-7 or 1e-5 for the BERT part and 1e-3 for the remainder.For regularization, we employ early stopping on the development set.We use a stacked BiLSTM with two hidden layers and 500 hidden units for all tasks with the exception of the experiment sentence de- Experiment sentence detection.Table 5 shows our results on the detection of experimentdescribing sentences.The neural models with bytepair encoding embeddings or BERT clearly outperform the SVM and logistic regression models.Within the neural models, BERT and SciBERT add the most value, both when using their embeddings as another input to the BiLSTM and when finetuning the original BERT models.Note that even the general-domain BERT is strong enough to cope with non-standard domains.Nevertheless, models based on SciBERT outperform BERT-based models, indicating that in-domain information is indeed beneficial.For performance reasons, we use BERT-base in our experiments, but for the sake of completeness, we also run BERT-large for the task of detecting experiment sentences.Because it did not outperform BERT-base in our cross-validation based development setting, we did not further experiment with BERT-large.However, we found that it resulted in the best F1-score achieved on our test set.In general, SciBERT-based models provide very good performance and seem most robust across dev and test sets.Overall, achieving F1-scores around 67.0-68.6,such a retrieval model may already be useful in production.However, there certainly is room for improvement.

Entity mention extraction.
Table 6 provides our results on entity mention detection and typing.Models are trained and results are reported on the subset of sentences marked as experimentdescribing in the gold standard, amounting to 4,590 entity mentions in total. 9The CRF baseline achieves comparable or better results than the Bi-LSTM with word2vec and/or mat2vec embeddings.However, adding subword-based embeddings (bpe and/or BERT) significantly increases performance of the BiLSTM, indicating that there are many rare words.Again, the best results are obtained when using BERT or SciBERT embeddings or when using the original SciBERT model.It is relatively easy for all model variants to recognize VALUE as these mentions usually consist of a number and unit which the model can easily memorize.Recognizing the types MATERIAL and DEVICE, in contrast, is harder and may profit from using gazetteer-based extensions.
Experiment slot filling.Table 7 shows the macro-average F1 scores for our different models on the slot identification task. 10As for entity typing, we train and evaluate our model on the subset of sentences marked as experiment-describing, which contain 4,263 slot instances.Again, the CRF baseline outperforms the BiLSTM when using only mat2vec and/or word2vec embeddings.The addition of BERT or SciBERT embeddings improves performance.However, on this task, the BiLSTM model with (Sci)BERT embeddings outperforms the fine-tuned original (Sci)BERT model.Compared to the other two tasks, this task requires more complex reasoning and has a larger number of possible output classes.We assume that in such a setting, adding more abstraction power to the model (in the form of a BiLSTM) leads to better results.For a more detailed analysis, Table 8 shows the slot-wise results for the non-neural CRF baseline and the model that performs best on the development set: BiLSTM with SciBERT embeddings.As in the case of entity mention detection, the models do well for the categories that consist of numeric mentions plus particular units.In general, model performance is also tied to the frequency of the slot types in the dataset.Recognizing the role a material plays in an experiment (e.g., AnodeMaterial vs. CathodeMaterial) remains challenging, possibly requiring background domain knowledge.This type of information is often not stated explicitly in the sentence, but introduced earlier in the discourse and would hence require document-level modeling.

Entity Extraction Evaluation on the Synthesis Procedures Dataset
As described in Section 2, the data set curated by Mysore et al. (2019)  our knowledge, there have not yet been any publications on the automatic modeling of this data set.We hence compare to the previous work of Mysore et al. (2017), who perform action graph induction on a similar data set.12Our implementation of BiLSTM-CRF mat2vec+word2vec roughly corresponds to their BiLSTM-CRF system.Table 9 shows the performance of our models when trained and evaluated on the synthesis procedures dataset.Detailed scores by entity type can be found in the Supplementary Material.We chose to use the data split suggested by the authors for the NER task, using 200 documents for training, and 15 documents for each dev and test set.Among the non-BERT-based systems, the BiLSTM variant using both mat2vec and word2vec performs best, indicating that the two pre-trained embeddings contain complementary information with regard to this task.The best performance is reached by the BiL-STM model including word2vec, mat2vec, bpe and SciBERT embeddings, with 92.2 micro-average F1 providing a strong baseline for future work.

Conclusion
We have presented a new dataset for information extraction in the materials science domain consisting of 45 open-access scientific articles related to solid oxide fuel cells.Our detailed corpus and interannotator agreement studies highlight the complexity of the task and verify the high annotation quality.Based on the annotated structures, we suggest three information extraction tasks: the detection of experiment-describing sentences, entity mention recognition and typing, and experiment slot filling.We have presented various strong baselines for them, generally finding that BERT-based models outperform other model variants.While some categories remain challenging, overall, our models show solid performance and thus prove that this type of data modeling is feasible and can lead to systems that are applicable in production settings.Along with this paper, we make the annotation guidelines and the annotated data freely available.
Outlook.In Section 7.1, we have shown that our findings generalize well by applying model architectures developed on our corpus to another dataset.A natural next step is to combine the datasets in a multi-task setting to investigate to what extent models can profit from combining the information annotated in the respective datasets.Further research will investigate the joint modeling of entity extraction, typing and experiment frame recognition.In addition, there are also further natural language processing tasks that can be researched using our dataset.They include the detection of events and sub-events when regarding the experiment-descriptions as events, and a more linguistically motivated evaluation of the framesemantic approach to experiment descriptions in text, e.g., moving away from the one-experimentper-sentence and one-sentence-per-experiment assumptions and modeling the graph-based structures as annotated.cells that use a solid oxide as electrolyte (Solid Oxide Fuel Cells or SOFCs) are very efficient and cost-effective, but can only operate at high temperatures (500-1000C), which can cause long start-up times and fast degradation.SOFCs can be used as stationary stand-alone devices, to produce clean power for residential or industrial purposes, or integrated with other power generation systems to increase the overall efficiency.

B Data Analysis: Between-Experiment Links
As stated in Section 3, we instructed annotators to mark the closest experiment-evoking word as EX-PERIMENT and link the respective slot arguments to this mention.In addition, the EXPERIMENT annotations could then be linked either by same exp or exp variation links.Table 10 shows some statistics on the number of EXPERIMENT annotations per sentence and how often the primary annotator actually made use of the possibility to link experiments.In the training data, out of 703 sentences describing experiments, 135 contain more than one experiment-evoking word, with 114 sentences containing two, 18 sentences containing three, and 3 sentences containing four EXPERIMENT annotations (see Table 10).In the 114 sentences containing two experiment annotations, only in 2 sentences, the EXPERIMENTs were not linked to any others.Upon being shown these cases, our primary annotator judged that one of them should actually have been linked.
Next, we analyze the number of cross-sentence links.In the training data, there are 256 same exp and 93 exp variation links, of which 138 and 57 cross sentence-boundaries respectively.Crosssentence links between experiment-evoking words and slot fillers rarely occur in our dataset (only 13 out of 2,540 times).

C Inter-annotator Agrement Study: further statistics
Table 11 shows the full set of statistics for the experiment slot agreement.

D Additional Experimental Results
In the following tables, we give detailed statistics for the experiments described in the main paper.
Table 12 reports full statistics for the task of identifying experiment-describing sentences, including precision and recall in the dev setting.
Table 13 reports F1 per entity type for the dev setting including standard deviations.With the exception of SVM, we downsample the non-experiment-describing sentences by 0.3.

Figure 1 :
Figure 1: Sentence describing a fuel-cell related experiment, annotated with Experiment frame information.

Table 2 :
Inter-annotator agreement study.Precision, recall and F1 for the subset of doubly-annotated documents.count refers to the number of mentions labeled with the respective type by our primary annotator.

Table 3 :
Inter-annotator agreement study.Precision, recall and F1 for labeling entity types.count refers to the number of mentions labeled with the respective type by our primary annotator.

Table 4 :
Inter-annotator agreement study.F1 was computed for the two annotators vs. each other on the set of experiment slots; IAA count refers to the number of mentions labeled with the respective type by our primary annotator in the inter-annotator agreement study (IAA).

Table 6 :
Experiments: entity mention detection and typing.Results on test set (experiment-describing sentences only) in terms of F1, rightmost column shows the macro-average.

Table 7 :
Experiments: slot identification.Model comparison in terms of macro F1.

Table 8 :
Experiments: slot identification.Results in terms of F1 on the test set, BiLSTM results averaged across 5 models.

Table 10 :
Data analysis.Number of EXPERIMENT annotations per sentence, and counts of links between them (within sentence).Training set: 703 experimentdescribing sentences.

Table 11 :
(Mysore et al., 2019) entity type/slot for the synthesis procedures dataset(Mysore et al., 2019).Inter-annotator agreement study.Precision, recall and F1 scores of the two annotators vs. each other on the set of slots.IAA count refers to the number of mentions labeled with the respective type by our primary annotator in the 6 documents of the inter-annotator agreement study.train count refers to the number of instances in the training set.(Conductivity has been added to the set of slots only after conducting the inter-annotator agreement study.)

Table 12 :
Experiments: Identifying experiment sentences.P, R and F1 for experiment-describing sentences.

Table 13 :
Experiments: entity mention extraction and labeling.Results on 5-fold cross validation for dev and test set (experiment-describing sentences only) in terms of F1.

Table 14 :
Mysore et al. (2017)g mention types in synthesis procedure data, most frequent entity types.Results in terms of F1. Results fromMysore et al. (2017)are not directly comparable.*Type called Descriptor in their paper.