Multi-Task Identification of Entities, Relations, and Coreference for Scientific Knowledge Graph Construction

We introduce a multi-task setup of identifying entities, relations, and coreference clusters in scientific articles. We create SciERC, a dataset that includes annotations for all three tasks and develop a unified framework called SciIE with shared span representations. The multi-task setup reduces cascading errors between tasks and leverages cross-sentence relations through coreference links. Experiments show that our multi-task model outperforms previous models in scientific information extraction without using any domain-specific features. We further show that the framework supports construction of a scientific knowledge graph, which we use to analyze information in scientific literature.


Introduction
As scientific communities grow and evolve, new tasks, methods, and datasets are introduced and different methods are compared with each other.Despite advances in search engines, it is still hard to identify new technologies and their relationships with what existed before.To help researchers more quickly identify opportunities for new combinations of tasks, methods and data, it is important to design intelligent algorithms that can extract and organize scientific information from a large collection of documents.
Organizing scientific information into structured knowledge bases requires information extraction (IE) about scientific entities and their relationships.However, the challenges associated with scientific IE are greater than for a general domain.First, annotation of scientific text requires domain expertise which makes annotation costly and limits resources.In addition, most relation extraction systems are designed for within-sentence relations.However, extracting information from scientific articles requires extracting relations across sentences.Figure 1 illustrates this problem.The cross-sentence relations between some entities can only be connected by entities that refer to the same scientific concept, including generic terms (such as the pronoun it, or phrases like our method) that are not informative by themselves.With co-reference, context-free grammar can be connected to MORPA through the intermediate co-referred pronoun it.Applying existing IE systems to this data, without co-reference, will result in much lower relation coverage (and a sparse knowledge base).
In this paper, we develop a unified learning model for extracting scientific entities, relations, and coreference resolution.This is different from previous work (Luan et al., 2017b;Gupta and Manning, 2011;Tsai et al., 2013;Gábor et al., 2018) which often addresses these tasks as independent arXiv:1808.09602v1[cs.CL] 29 Aug 2018 components of a pipeline.Our unified model is a multi-task setup that shares parameters across low-level tasks, making predictions by leveraging context across the document through coreference links.Specifically, we extend prior work for learning span representations and coreference resolution (Lee et al., 2017;He et al., 2018).Different from a standard tagging system, our system enumerates all possible spans during decoding and can effectively detect overlapped spans.It avoids cascading errors between tasks by jointly modeling all spans and span-span relations.
To explore this problem, we create a dataset SCI-ERC for scientific information extraction, which includes annotations of scientific terms, relation categories and co-reference links.Our experiments show that the unified model is better at predicting span boundaries, and it outperforms previous state-of-the-art scientific IE systems on entity and relation extraction (Luan et al., 2017b;Augenstein et al., 2017).In addition, we build a scientific knowledge graph integrating terms and relations extracted from each article.Human evaluation shows that propagating coreference can significantly improve the quality of the automatic constructed knowledge graph.
In summary we make the following contributions.We create a dataset for scientific information extraction by jointly annotating scientific entities, relations, and coreference links.Extending a previous end-to-end coreference resolution system, we develop a multi-task learning framework that can detect scientific entities, relations, and coreference clusters without hand-engineered features.We use our unified framework to build a scientific knowledge graph from a large collection of documents and analyze information in scientific literature.
More recently, two datasets in SemEval 2017 and 2018 have been introduced, which facilitate research on supervised and semi-supervised learning for scientific information extraction.SemEval 17 (Augenstein et al., 2017) includes 500 paragraphs from articles in the domains of computer science, physics, and material science.It includes three types of entities (called keyphrases): Tasks, Methods, and Materials and two relation types: hyponym-of and synonym-of.SemEval 18 (Gábor et al., 2018) is focused on predicting relations between entities within a sentence.It consists of six relation types.Using these datasets, neural models (Ammar et al., 2017(Ammar et al., , 2018;;Luan et al., 2017b;Augenstein and Søgaard, 2017) are introduced for extracting scientific information.We extend these datasets by increasing relation coverage, adding cross-sentence coreference linking, and removing some annotation constraints.Different from most previous IE systems for scientific literature and general domains (Miwa and Bansal, 2016;Xu et al., 2016;Peng et al., 2017;Quirk and Poon, 2017;Luan et al., 2018;Adel and Schütze, 2017), which use preprocessed syntactic, discourse or coreference features as input, our unified framework does not rely on any pipeline processing and is able to model overlapping spans.
While Singh et al. (2013) show improvements by jointly modeling entities, relations, and coreference links, most recent neural models for these tasks focus on single tasks (Clark and Manning, 2016;Wiseman et al., 2016;Lee et al., 2017;Lample et al., 2016;Peng et al., 2017) or joint entity and relation extraction (Katiyar and Cardie, 2017;Zhang et al., 2017;Adel and Schütze, 2017;Zheng et al., 2017).Among those studies, many papers assume the entity boundaries are given, such as (Clark and Manning, 2016), Adel and Schütze (2017) and Peng et al. (2017).Our work relaxes this constraint and predicts entity boundaries by optimizing over all possible spans.Our model draws from recent end-to-end span-based models for coreference resolution (Lee et al., 2017(Lee et al., , 2018) ) and semantic role labeling (He et al., 2018) and extends them for the multi-task framework involving the three tasks of identification of entity, relation and coreference.
Neural multi-task learning has been applied to a range of NLP tasks.Most of these models share word-level representations (Collobert and Weston, 2008;Klerke et al., 2016;Luan et al., 2016Luan et al., , 2017a;;Rei, 2017), while Peng et al. (2017) uses high-order cross-task factors.Our model instead propagates cross-task information via span representations, which is related to Swayamdipta et al. (2017).

Dataset
Our dataset (called SCIERC) includes annotations for scientific entities, their relations, and coreference clusters for 500 scientific abstracts.These abstracts are taken from 12 AI conference/workshop proceedings in four AI communities from the Semantic Scholar Corpus2 .SCIERC extends previous datasets in scientific articles SemEval 2017 Task 10 (SemEval 17) (Augenstein et al., 2017) and SemEval 2018 Task 7 (SemEval 18) (Gábor et al., 2018) by extending entity types, relation types, relation coverage, and adding cross-sentence relations using coreference links.Our dataset is publicly available at: http://nlp.cs.washington.edu/sciIE/.Table 1 shows the statistics of SCI-ERC.
Annotation Scheme We define six types for annotating scientific entities (Task, Method, Metric, Material, Other-ScientificTerm and Generic) and seven relation types (Compare, Part-of, Conjunction, Evaluate-for, Feature-of, Used-for, Hyponym-Of).Directionality is taken into account except for the two symmetric relation types (Conjunction and Compare).Coreference links are annotated between identical scientific entities.A Generic entity is annotated only when the entity is involved in a relation or is coreferred with another entity.Annotation guidelines can be found in Appendix A. Figure 1 shows an annotated example.
Following annotation guidelines from Qasem-iZadeh and Schumann (2016) and using the BRAT interface (Stenetorp et al., 2012), our annotators perform a greedy annotation for spans and always prefer the longer span whenever ambiguity occurs.Nested spans are allowed when a subspan has a relation/coreference link with another term outside the span.

Model
We develop a unified framework (called SCIIE) to identify and classify scientific entities, relations, and coreference resolution across sentences.SCIIE is a multi-task learning setup that extends previous span-based models for coreference resolution (Lee et al., 2017) and semantic role labeling (He et al., 2018).All three tasks of entity recognition, relation extraction, and coreference resolution are treated as multinomial classification problems with shared span representations.SCIIE benefits from expressive contextualized span representations as classifier features.By sharing span representations, sentence-level tasks can benefit from information propagated from coreference resolution across sentences, without increasing the complexity of inference.Figure 2 shows a high-level overview of the SCIIE multi-task framework.

Problem Definition
The input is a document represented as a sequence of words D = {w 1 , . . ., w n }, from which we derive S = {s 1 , . . ., s N }, the set of all possible Coreference resolution is to predict the best antecedent (including a special null antecedent) given a span, which is the same mention-ranking model used in Lee et al. (2017).The output structure C is a set of random variables defined as: c i ∈ {1, . . ., i − 1, } for i = 1, . . ., N .

Model Definition
We formulate the multi-task learning setup as learning the conditional probability distribution P (E, R, C|D).For efficient training and inference, we decompose P (E, R, C|D) assuming spans are conditionally independent given D: where the conditional probabilities of each random variable are independently normalized: e ∈L E exp(Φ E (e , s i )) (2) r ∈L R exp(Φ R (r , s i , s j )) , where Φ E denotes the unnormalized model score for an entity type e and a span s i , Φ R denotes the score for a relation type r and span pairs s i , s j , and Φ C denotes the score for a binary coreference link between s i and s j .These Φ scores are further decomposed into span and pairwise span scores computed from feed-forward networks, as will be explained in Section 4.3.
For simplicity, we omit D from the Φ functions and S from the observation.
Objective Given a set of all documents D, the model loss function is defined as a weighted sum of the negative log-likelihood loss of all three tasks: (3) where E * , R * , and C * are gold structures of the entity types, relations, and coreference, respectively.The task weights λ E , λ R , and λ C are introduced as hyper-parameters to control the importance of each task.
For entity recognition and relation extraction, P (E * | D) and P (R * | D) are computed with the definition in Equation (2).For coreference resolution, we use the marginalized loss following Lee et al. (2017) since each mention can have multiple correct antecedents.Let C * i be the set of all correct antecedents for span i, we have:

Scoring Architecture
We use feedforward neural networks (FFNNs) over shared span representations g to compute a set of span and pairwise span scores.For the span scores, φ e (s i ) measures how likely a span s i has an entity type e, and φ mr (s i ) and φ mc (s i ) measure how likely a span s i is a mention in a relation or a coreference link, respectively.The pairwise scores φ r (s i , s j ) and φ c (s i , s j ) measure how likely two spans are associated in a relation r or a coreference link, respectively.Let g i be the fixed-length vector representation for span s i .For different tasks, the span scores φ x (s i ) for x ∈ {e, mc, mr} and pairwise span scores φ y (s i , s j ) for y ∈ {r, c} are computed as follows: where is element-wise multiplication, and {w x , w y } are neural network parameters to be learned.
We use these scores to compute the different Φ: The scores in Equation ( 4) are defined for entity types, relations, and antecedents that are not the null-type .Scores involving the null label are set to a constant 0: We use the same span representations g from (Lee et al., 2017) and share them across the three tasks.We start by building bi-directional LSTMs (Hochreiter and Schmidhuber, 1997) from word, character and ELMo (Peters et al., 2018) embeddings.
For a span s i , its vector representation g i is constructed by concatenating s i 's left and right end points from the BiLSTM outputs, an attentionbased soft "headword," and embedded span width features.Hyperparameters and other implementation details will be described in Section 6.

Inference and Pruning
Following previous work, we use beam pruning to reduce the number of pairwise span factors from O(n 4 ) to O(n 2 ) at both training and test time, where n is the number of words in the document.We define two separate beams: B C to prune spans for the coreference resolution task, and B R for relation extraction.The spans in the beams are sorted by their span scores φ mc and φ mr respectively, and the sizes of the beams are limited by λ C n and λ R n.We also limit the maximum width of spans to a fixed number W , which further reduces the number of span factors to O(n).

Knowledge Graph Construction
We construct a scientific knowledge graph from a large corpus of scientific articles.The corpus includes all abstracts (110k in total) from 12 AI conference proceedings from the Semantic Scholar Corpus.Nodes in the knowledge graph correspond to scientific entities.Edges correspond to scientific relations between pairs of entities.The edges are typed according to the relation types defined in Section 3. Figure 4 shows a part of a knowledge graph created by our method.For example, Statistical Machine Translation (SMT) and grammatical error correction are nodes in the graph, and they are connected through a Used-for relation type.In order to construct the knowledge graph for the whole corpus, we first apply the SCIIE model over single documents and then integrate the entities and relations across multiple documents (Figure 3).

Extracting nodes (entities) The SCIIE model extracts entities, their relations, and coreference
Abstract(1) < l a t e x i t s h a 1 _ b a s e 6 4 = " p l P M i 3 j X w t 5 K e Y 9 p x t E G l b c h e L M v z 5 P a c c l z S 9 7 N S b F 8 P o 0 j R / b J A T k i H j k l Z X J F K q R K O N H k m b y S N + f R e X H e n Y 9 J 6 4 I z n d k j f + B 8 / g C 0 y p K a < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " d k M V Q q B b l j s C 8 7 O R S p h 5 l 0 8 i u s Y = " > A A A B 9 X i c b V D L S g N B E J z 1 G e M r 6 t H L Y B A 8 h V 0 R F E 8 B E f Q W 0 T w g W c P s p D c Z M j u 7 z P S q Y c l / e P G g i F f / x Z t / 4 + R x 0 M S C h q K q m + 6 u I J H C o O t + O w u L S 8 s r q 7 m 1 / P r G 5 t Z 2 Y W e 3 Z u J U c 6 j y W M a 6 j X w t 5 K e Y 9 p x t E G l b c h e L M v z 5 P a c c l z S 9 7 N S b F 8 P o 0 j R / b J A T k i H j k l Z X J F K q R K O N H k m b y S N + f R e X H e n Y 9 J 6 4 I z n d k j f + B 8 / g C 0 y p K a < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " d k M V Q q B b l j s C 8 7 O R S p h 5 l 0 8 i u s Y = " > A A A B 9 X i c b V D L S g N B E J z 1 G e M r 6 t H L Y B A 8 h V 0 R F E 8 B E f Q W 0 T w g W c P s p D c Z M j u 7 z P S q Y c l / e P G g i F f / x Z t / 4 + R x 0 M S C h q K q m + 6 u I J H C o O t + O w u L S 8 s r q 7 m 1 / P r G 5 t Z 2 Y W e 3 Z u J U c 6 j y W M a 6 j X w t 5 K e Y 9 p x t E G l b c h e L M v z 5 P a c c l z S 9 7 N S b F 8 P o 0 j R / b J A T k i H j k l Z X J F K q R K O N H k m b y S N + f R e X H e n Y 9 J 6 4 I z n d k j f + B 8 / g C 0 y p K a < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " d k M V Q q B b l j s C 8 7 O R S p h 5 l 0 8 i u s Y = " > A A A B 9 X i c b V D L S g N B E J z 1 G e M r 6 t H L Y B A 8 h V 0 R F E 8 B E f Q W 0 T w g W c P s p D c Z M j u 7 z P S q Y c l / e P G g i F f / x Z t / 4 + R x 0 M S C h q K q m + 6 u I J H C o O t + O w u L S 8 s r q 7 m 1 / P r G 5 t Z 2 Y W e 3 Z u J U c 6 j y W M a 6  clusters within one document.Phrases are heuristically normalized (described in Section 6) using entities and coreference links.In particular, we link all entities that belong to the same coreference cluster to replace generic terms with any other nongeneric term in the cluster.Moreover, we replace all the entities in the cluster with the entity that has the longest string.Our qualitative analysis shows that there are fewer ambiguous phrases using coreference links (Figure 5).We calculate the frequency counts of all entities that appear in the whole corpus.We assign nodes in the knowledge graph by selecting the most frequent entities (with counts > k) in the corpus, and merge in any remaining entities for which a frequent entity is a substring.
Assigning edges (relations) A pair of entities may appear in different contexts, resulting in different relation types between those entities (Figure 6).For every pair of entities in the graph, we calculate the frequency of different relation types across the whole corpus.We assign edges between entities by selecting the most frequent relation type.

Experimental Setup
We evaluate our unified framework SCIIE on SCI-ERC and SemEval 17.The knowledge graph for  scientific community analysis is built using the Semantic Scholar Corpus (110k abstracts in total).

Baselines
We compare our model with the following baselines on SCIERCdataset: • LSTM+CRF The state-of-the-art NER system (Lample et al., 2016), which applies CRF on top of LSTM for named entity tagging, the approach has also been used in scientific term extraction (Luan et al., 2017b).
• E2E Rel State-of-the-art joint entity and relation extraction system (Miwa and Bansal, 2016) that has also been used in scientific literature (Peters et al., 2017;Augenstein et al., 2017).This system uses syntactic features such as part-of-speech tagging and dependency parsing.In the SemEval task, we compare our model SCIIE with the best reported system in the SemEval leaderboard (Peters et al., 2017), which extends E2E Rel with several in-domain features such as gazetteers extracted from existing knowledge bases and model ensembles.We also compare with the state of the art on keyphrase extraction (Luan et al., 2017b), which applies semi-supervised methods to a neural tagging model. 3

Implementation details
Our system extends the implementation and hyperparameters from Lee et al. (2017) with the following adjustments.We use a 1 layer BiLSTM with 200-dimensional hidden layers.All the FFNNs have 2 hidden layers of 150 dimensions each.We use 0.4 variational dropout (Gal and Ghahramani, 2016) for the LSTMs, 0.4 dropout for the FFNNs, and 0.5 dropout for the input embeddings.We model spans up to 8 words.For beam pruning, we use λ C = 0.3 for coreference resolution and λ R = 0.4 for relation extraction.For constructing the knowledge graph, we use the following heuristics to normalize the entity phrases.We replace all acronyms with their corresponding full name and normalize all the plural terms with their singular counterparts.

Experimental Results
We evaluate SCIIE on SCIERC and SemEval 17 datasets.We provide qualitative results and human evaluation of the constructed knowledge graph.We still observe a large gap between human-level performance and a machine learning system.We invite the community to address this challenging task.

Results on SciERC
Ablations We evaluate the effect of multi-task learning in each of the three tasks defined in our dataset.Our model outperforms all the previous models that use hand-designed features.We observe more significant improvement in span identification than keyphrase classification.This confirms the benefit of our model in enumerating spans (rather than BIO tagging in state-of-the-art systems).Moreover, we have competitive results compared to the previous state of the art in relation extraction.We observe less gain compared to the SCIERC dataset mainly because there are no coference links, and the relation types are not comprehensive.

Knowledge Graph Analysis
We provide qualitative analysis and human evaluations on the constructed knowledge graph.
Scientific trend analysis Figure 7 shows the historical trend analysis (from 1996 to 2016) of the most popular applications of the phrase neural network, selected according to the statistics of the extracted relation triples with the 'Used-for' relation type from speech, computer vision, and NLP conference papers.We observe that, before 2000, neural network has been applied to a greater percentage of speech applications compared to the NLP and computer vision papers.In NLP, neural networks first gain popularity in language modeling    and then extend to other tasks such as POS Tagging and Machine Translation.In computer vision, the application of neural networks gains popularity in object recognition earlier (around 2010) than the other two more complex tasks of object detection and image segmentation (hardest and also the latest).
Knowledge Graph Evaluation Figure 8 shows the human evaluation of the constructed knowledge graph, comparing the quality of automatically generated knowledge graphs with and without the coreference links.We randomly select 10 frequent scientific entities and extract all the relation triples that include one of the selected entities leading to 1.5k relation triples from both systems.We ask four domain experts to annotate each of these ex- tracted relations to define ground truth labels.Each domain expert is assigned 2 or 3 entities and all of the corresponding relations.Figure 8 shows precision/recall curves for both systems.Since it is not feasible to compute the actual recall of the systems, we compute the pseudo-recall (Zhang et al., 2015) based on the output of both systems.We observe that the knowledge graph curve with coreference linking is mostly above the curve without coreference linking.The precision of both systems is high (above 84% for both systems), but the system with coreference links has significantly higher recall.

Conclusion
In this paper, we create a new dataset and develop a multi-task model for identifying entities, relations, and coreference clusters in scientific articles.By sharing span representations and leveraging crosssentence information, our multi-task setup effectively improves performance across all tasks.Moreover, we show that our multi-task model is better at predicting span boundaries and outperforms previous state-of-the-art scientific IE systems on entity and relation extraction, without using any handengineered features or pipeline processing.Using our model, we are able to automatically organize the extracted information from a large collection of scientific articles into a knowledge graph.Our analysis shows the importance of coreference links in making a dense, useful graph.We still observe a large gap between the performance of our model and human performance, confirming the challenges of scientific IE.Future work includes improving the performance using semisupervised techniques and providing in-domain features.We also plan to extend our multi-task framework to information extraction tasks in other domains.

Figure 1 :
Figure 1: Example annotation: phrases that refer to the same scientific concept are annotated into the same coreference cluster, such as MORphological PAser MORPA, it and MORPA (marked as red).
Figure 3: Knowledge graph construction process.

Figure 4 :
Figure 4: A part of an automatically constructed scientific knowledge graph with the most frequent neighbors of the scientific term statistical machine translation (SMT) on the graph.For simplicity we denote Used-for (Reverse) as Uses, Evaluated-for (Reverse) as Evaluated-by, and replace common terms with their acronyms.The original graph and more examples are given Figure 10 in Appendix B.

Figure 5 :Figure 6 :
Figure5: Frequency of detected entities with and without coreferece resolution: using coreference reduces the frequency of the generic phrase detection while significantly increasing the frequency of specific phrases.Linking entities through coreference helps disambiguate phrases when generating the knowledge graph.

Figure 7 :
Figure7: Historical trend for top applications of the keyphrase neural network in NLP, speech, and CV conference papers we collected.y-axis indicates the ratio of papers that use neural network in the task to the number of papers that is about the task.

Figure 8 :
Figure 8: Precision/pseudo-recall curves for human evaluation by varying cut-off thresholds.The AUC is 0.751 with coreference, and 0.695 without.

Figure 9 :
Figure 9: Annotation example 1 from ACL Figure 10: An example of our automatically generated knowledge graph centered on statistical machine translation.This is the original figure ofFigure 4.

Table 1 :
Dataset statistics for our dataset SCIERC and two previous datasets on scientific information extraction.All datasets annotate 500 documents.
stract).SCIERC extends these datasets by adding more relation types and coreference clusters, which allows representing cross-sentence relations, and removing annotation constraints.Table1gives a comparison of statistics among the three datasets.In addition, SCIERC aims at including broader coverage of general AI communities.

Sentences BiLSTM outputs Span Representations Entity Recognition +Span Features Coreference Resolution Relation Extraction … …
Table2compares the result of our model with baselines on the three tasks: entity recognition (Table2a), relation extraction (Table 2b), and coreference resolution (Table2c).As evidenced by the table, our unified multi-task setup3We compare with the inductive setting results.

Table 2 :
Comparison with previous systems on the development and test set for our three tasks.For coreference resolution, we report the average P/R/F1 of MUC, B 3 , and CEAF φ 4 scores.

Table 3
Results on SemEval 17Table 4 compares the results of our model with the state of the art on the SemEval 17 dataset for tasks of span identification, keyphrase extraction and relation extraction as well as the overall score.Span identification aims at identifying spans of entities.Keyphrase classification and relation extraction has the same setting with the entity and relation extraction in SCIERC.

Table 4 :
Results for scientific keyphrase extraction and extraction on SemEval 2017 Task 10, comparing with previous best systems.