Entity-Relation Extraction as Multi-Turn Question Answering

In this paper, we propose a new paradigm for the task of entity-relation extraction. We cast the task as a multi-turn question answering problem, i.e., the extraction of entities and elations is transformed to the task of identifying answer spans from the context. This multi-turn QA formalization comes with several key advantages: firstly, the question query encodes important information for the entity/relation class we want to identify; secondly, QA provides a natural way of jointly modeling entity and relation; and thirdly, it allows us to exploit the well developed machine reading comprehension (MRC) models. Experiments on the ACE and the CoNLL04 corpora demonstrate that the proposed paradigm significantly outperforms previous best models. We are able to obtain the state-of-the-art results on all of the ACE04, ACE05 and CoNLL04 datasets, increasing the SOTA results on the three datasets to 49.6 (+1.2), 60.3 (+0.7) and 69.2 (+1.4), respectively. Additionally, we construct and will release a newly developed dataset RESUME, which requires multi-step reasoning to construct entity dependencies, as opposed to the single-step dependency extraction in the triplet exaction in previous datasets. The proposed multi-turn QA model also achieves the best performance on the RESUME dataset.


Introduction
Identifying entities and their relations is the prerequisite of extracting structured knowledge from unstructured raw texts, which has recieved growing interest these years.Given a chunk of natural language text, the goal of entity-relation extraction is to transform it to a structural knowledge base.For example, given the following text: In 2002, Musk founded SpaceX, an aerospace manufacturer and space transport services Company, of which he is CEO and lead designer.He helped fund Tesla, Inc., an electric vehicle andsolar panel manufacturer, in 2003, andbecame its CEO andproduct architect. In 2006, he inspired the creation of SolarCity, a solar energy services Company, and operates as its chairman.In 2016, he co-founded Neuralink, a neurotechnology Company focused on developing braincomputer interfaces, and is its CEO.In 2016, Musk founded The Boring Company, an infrastructure and tunnel-construction Company.
We need to extract four different types of entities, i.e., Person, Company, Time and Position, and three types of relations, FOUND, FOUNDING-TIME and SERVING-ROLE.The text is to be transformed into a structural dataset shown in Table 1.
Most existing models approach this task by extracting a list of triples from the text, i.e., REL(e 1 , e 2 ), which denotes that relation REL holds between entity e 1 and entity e 2 .Previous models fall into two major categories: the pipelined approach, which first uses tagging models to identify entities, and then uses relation extraction models to identify the relation between each entity pair; and the joint approach, which combines the entity model and the relation model throught different strategies, such as constraints or parameters sharing.
There are several key issues with current ap-proaches, both in terms of the task formalization and the algorithm.At the formalization level, the REL(e 1 , e 2 ) triplet structure is not enough to fully express the data structure behind the text.Take the Musk case as an example, there is a hierarchical dependency between the tags: the extraction of In the paper, we propose a new paradigm to handle the task of entity-relation extraction.We formalize the task as a multi-turn question answering task: each entity type and relation type is characterized by a question answering template, and entities and relations are extracted by answering template questions.Answers are text spans, extracted using the now standard machine reading comprehension (MRC) framework: predicting answer spans given context (Seo et al., 2016;Wang and Jiang, 2016;Xiong et al., 2017;Wang et al., 2016b).To extract structural data like Table 1, the model need to answer the following questions sequentially: • Q: who is mentioned in the text?A: Musk; CEO.Treating the entity-relation extraction task as a multi-turn QA task has the following key advantages: (1) the multi-turn QA setting provides an elegant way to capture the hierarchical dependency of tags.As the multi-turn QA proceeds, we progressively obtain the entities we need for the next turn.This is closely akin to the multi-turn 3 e.g., in text A B C D, (A, C) is a pair and (B, D) is a pair.slot filling dialogue system (Williams and Young, 2005;Lemon et al., 2006); (2) the question query encodes important prior information for the entity/relation class we want to identify.For example, information in the query of the PER tagging class who is mentioned in the text helps the model to extract relevant name entities.On the contrary, in traditional non-QA entity-relation extraction models, a tagging classes or relation classes are merely indices (class1, class2, ...) and do not encode any information about the class.This informativeness can potentially solve the issues that existing relation extraction models fail to solve, such as distantly-separated entity pairs, relation span overlap, etc; (3) the QA framework provides a natural way to simultaneously extract entities and relations: most MRC models support outputting special NONE tokens, indicating that there is no answer to the question.Through this, the original two tasks, entity extraction and relation extraction can be merged to a single QA task: a relation holds if the returned answer to the question corresponding to that relation is not NONE, and this returned answer is the entity that we wish to extract.
In this paper, we show that the proposed paradigm, which transforms the entity-relation extraction task to a multi-turn QA task, introduces significant performance boost over existing systems.It achieves state-of-the-art (SOTA) performance on the ACE and the CoNLL04 datasets.The tasks on these datasets can be formalized as triplet extraction problems, in which two turns of QA suffice.We thus build a more complicated and more difficult dataset called RESUME which requires to extract biographical information of individuals from raw texts.The construction of structural knowledge base from RESUME requires four or five turns of QA.We also show that this multiturn QA setting could easilty integrate reinforcement learning (just as in multi-turn dialog systems) to gain additional performance boost.

Extracting Entities and Relations
Many earlier entity-relation extraction systems are pipelined (Zelenko et al., 2003;Miwa et al., 2009;Chan and Roth, 2011;Lin et al., 2016): an entity extraction model first identifies entities of interest and a relation extraction model then constructs relations between the extracted entities.Although pipelined systems has the flexibility of integrat-ing different data sources and learning algorithms, they suffer significantly from error propagation.
To tackle this issue, joint learning models have been proposed.Earlier joint learning approaches connect the two models through various dependencies, including constraints solved by integer linear programming (Yang and Cardie, 2013;Roth and Yih, 2007), card-pyramid parsing (Kate and Mooney, 2010), and global probabilistic graphical models (Yu and Lam, 2010;Singh et al., 2013).In later studies, Li and Ji (2014) extract entity mentions and relations using structured perceptron with efficient beamsearch, which is significantly more efficient and less Time-consuming than constraint-based approaches.Miwa and Sasaki (2014); Gupta et al. (2016); Zhang et al. (2017) proposed the tablefilling approach, which provides an opportunity to incorporating more sophisticated features and algorithms into the model, such as search orders in decoding and global features.Neural network models have been widely used in the literature as well.Miwa and Bansal (2016) introduced an endto-end approach that extract entities and their relations using neural network models with shared parameters, i.e., extracting entities using a neural tagging model and extracting relations using a neural multi-class classification model based on tree LSTMs (Tai et al., 2015).Wang et al. (2016a) extract relations using multi-level attention CNNs.Zeng et al. (2018) proposed a new framework that uses sequence-to-sequence models to generate entity-relation triples, naturally combining entity detection and relation detection.
Another  2018) used hierarchical reinforcement learning to extract entities and relations in a hierarchical manner.

Machine Reading Comprehension
Main-stream MRC models (Seo et al., 2016;Wang and Jiang, 2016;Xiong et al., 2017;Wang et al., 2016b) extract text spans in passages given queries.Text span extraction can be simplified to two multi-class classification tasks, i.e., predicting the starting and the ending positions of the answer.Similar strategy can be extended to multi-passage MRC (Joshi et al., 2017;Dunn et al., 2017) where the answer needs to be selected from multiple passages.Multipassage MRC tasks can be easily simplified to single-passage MRC tasks by concatenating passages (Shen et al., 2017;Wang et al., 2017b).Wang et al. (2017a) first rank the passages and then run single-passage MRC on the selected passage.Tan et al. (2017) train the passage ranking model jointly with the reading comprehension model.Pretraining methods like BERT (Devlin et al., 2018) or Elmo (Peters et al., 2018) have proved to be extremely helpful in MRC tasks.
There has been a tendency of casting non-QA NLP tasks as QA tasks (McCann et al., 2018).Our work is highly inspired by Levy et al. (2017).Levy et al. (2017) andMcCann et al. (2018) focus on identifying the relation between two predefined entities and the authors formalize the task of relation extraction as a single-turn QA task.In the current paper we study a more complicated scenario, where hierarchical tag dependency needs to be modeled and single-turn QA approach no longer suffices.We show that our multi-turn QA method is able to solve this challenge and obtain new state-of-the-art results.

RESUME: A newly constructed dataset
The ACE and the CoNLL-04 datasets are intended for triplet extraction, and two turns of QA is sufficient to extract the triplet (one turn for headentities and another for joint extraction of tailentities and relations).These datasets do not involve hierarchical entity relations as in our previous Musk example, which are prevalent in real life applications.
Therefore, we construct a new dataset called RESUME.We extract 841 paragraphs from chapters describing management teams in IPO prospectuses.Each paragraph describes some work history of an executive.We wish to extract the structural data from the resume.
We identify four types of entities: Person (the name of the executive), Company (the company that the executive works/worked for), Position (the position that he/she holds/held) and Time (the time period that the executive occupies/occupied that position).It is worth noting that one person can work for different companies during different periods of time and that one person can hold different positions in different periods of time for the same company.
We recruited crowdworkers to fill the slots in Table 1.We asked them to spend 5 minutes on each passage and paid them $1 per sentence.Each passage is labeled by two different crowdworkers.If labels from the two annotators disagree, one or more annotators were asked to label the sentence and a majority vote was taken as the final decision.Since the wording of the text is usually very explicit and formal, the inter-agreement between annotators is very high, achieving a value of 93.5% for all slots.Some statistics of the dataset are shown in Table 2.We randomly split the dataset into training (80%), validation(10%) and test set (10%).

System Overview
The overview of the algorithm is shown in Algorithm 1.The algorithm contains two stages: 4 https://github.com/tticoin/LSTM-ER/.(1) The head-entity extraction stage (line 4-9): each episode of multi-turn QA is triggered by an entity.To extract this starting entity, we transform each entity type to a question using Enti-tyQuesTemplates (line 4) and the entity e is extracted by answering the question (line 5).If the system outputs the special NONE token, then it means s does not contain any entity of that type.
(2) The relation and the tail-entity extraction stage (line 10-24): ChainOfRelTemplates defines a chain of relations, the order of which we need to follow to run multi-turn QA.The reason is that the extraction of some entities depends on the extraction of others.For example, in the RESUME dataset, the position held by an executive relies on the company he works for.Also the extraction of the Time entity relies on the extraction of both the Company and the Position.The extraction order is manually pre-defined.ChainOfRelTemplates also defines the template for each relation.Each template contains some slots to be filled.a question (line 14), we insert previously extracted entity/entities to the slot/slots in a template.The relation REL and tail-entity e will be jointly extracted by answering the generated question (line 15).A returned NONE token indicates that there is no answer in the given sentence.
It is worth noting that entities extracted from the head-entity extraction stage may not all be head entities.In the subsequent relation and tail-entity extraction stage, extracted entities from the first stage are initially assumed to be head entities, and are fed to the templates to generate questions.If an entity e extracted from the first stage is indeed a head-entity of a relation, then the QA model will extract the tail-entity by answering the corresponding question.Otherwise, the answer will be NONE and thus ignored.
For ACE04, ACE05 and CoNLL04 datasets, only two QA turns are needed.ChainOfRelTemplates thus only contain chains of 1.For RE-SUME, we need to extract 4 entities, so Chain-OfRelTemplates contain chains of 3.

Generating Questions using Templates
Each entity type is associated with a type-specific question generated by the templates, as shown in Table 3.There are two ways to generate ques-tions based on templates: natural language questions or pseudo-questions.A pseudo-question is not necessarily grammatical.For example, the natural language question for the Facility type could be Which facility is mentioned in the text, and the pseudo-question could just be entity: facility.
At the relation and the tail-entity joint extraction stage, a question is generated by combing a relation-specific template with the extracted headentity.The question could be either a natural language question or a pseudo-question.Examples are shown in Table 4 and Table 5.

Extracting Answer Spans via MRC
Various MRC models have been proposed, such as BiDAF (Seo et al., 2016) and QANet (Yu et al., 2018).In the standard MRC setting, given a question Q = {q 1 , q 2 , ..., q Nq } where N q denotes the number of words in Q, and context C = {c 1 , c 2 , ..., c Nc }, where N c denotes the number of words in C, we need to predict the answer span.For the QA framework, we use BERT (Devlin et al., 2018) as a backbone.BERT performs bidirectional language model pretraining on large-scale datasets using transformers (Vaswani et al., 2017) and achieves SOTA results on MRC datasets like SQUAD (Rajpurkar et al., 2016).To align with the BERT framework, the question Q and the context C are combined by concatenating the list [CLS, Q, SEP, C, SEP], where CLS and SEP are special tokens, Q is the tokenized question and C is the context.The representation of each context token is obtained using multi-layer transformers.
Traditional MRC models (Wang and Jiang, 2016;Xiong et al., 2017) predict the starting and ending indices by applying two softmax layers to the context tokens.This softmax-based span extraction strategy only fits for single-answer extraction tasks, but not for our task, since one sentence/passage in our setting might contain multiple answers.
To tackle this issue, we formalize the task as a query-based tagging problem (Lafferty et al., 2001;Huang et al., 2015;Ma and Hovy, 2016).Specially, we predict a BMEO (beginning, inside, ending and outside) label for each token in the context given the query.The representation of each word is fed to a softmax layer to output a BMEO label.One can think that we are transforming two N-class classification tasks of predicting the starting and the ending indices (where N denotes the length of sentence) to N 5-class classification tasks5 .
Training and Test At the training time, we jointly train the objectives for the two stages: L = (1 − λ)L(head-entity) + λL(tail-entity, rel) (1) λ ∈ [0, 1] is the parameter controling the trade-off between the two objectives.Its value is tuned on the validation set.Both the two models are initialized using the standard BERT model and they share parameters during the training.At test time, head-entities and tail-entities are extracted separately based on the two objectives.

Reinforcement Learning
Note that in our setting, the extracted answer from one turn not only affects its own accuracy, but also determines how a question will be constructed for the downstream turns, which in turn affect later accuracies.We decide to use reinforcement learning to tackle it, which has been proved to be successful in multi-turn dialogue generation (Mrkšić et al., 2015;Li et al., 2017;Wen et al., 2016), a task that has the same challenge as ours.
Action and Policy In a RL setting, we need to define action and policy.In the multi-turn QA setting, the action is selecting a text span in each turn.The policy defines the probability of selecting a certain span given the question and the context.As the algorithm relies on the BMEO tagging output, the probability of selecting a certain span {w 1 , w 2 , ..., w n } is the joint probability of w 1 being assigned to B (beginning), w 2 , ..., w n−1 being assigned to M (inside) and w n being assigned to E (end), written as follows: Reward For a given sentence s, we use the number of correctly retrieved triples as rewards.We use the REINFORCE algorithm (Williams, 1992), a kind of policy gradient method, to find the optimal policy, which maximizes the expected reward E π [R(w)].The expectation is approximated by sampling from the policy π and the gradient is computed using the likelihood ratio: (3) where b denotes a baseline value.For each turn in the multi-turn QA setting, getting an answer correct leads to a reward of +1 .The final reward is the accumulative reward of all turns.The baseline value is set to the average of all previous rewards.We do not initialize policy networks from scratch, but use the pre-trained head-entity and tail-entity extraction model described in the previous section.We also use the experience replay strategy (Mnih et al., 2015): for each batch, half of the examples are simulated and the other half is randomly selected from previously generated examples.
For the RESUME dataset, we use the strategy of curriculum learning (Bengio et al., 2009), i.e., we gradually increase the number of turns from 2 to 4 at training.For ACE04, ACE05 and CoNLL-04, no curriculum learning is needed since there are only two turns.5 Experimental Results

Results on RESUME
Answers are extracted according to the order of Person (first-turn), Company (second-turn), Position (third-turn) and Time (forth-turn), and the extraction of each answer depends on those prior to them.
For baselines, we first implement a joint model in which entity extraction and relation extraction are trained together (denoted by tagging+relation).As in Zheng et al. (2017), entities are extracted using BERT tagging models, and relations are extracted by applying a CNN to representations output by BERT transformers.
Existing baselines which involve entity and relation identification stages (either pipelined or joint) are well suited for triplet extractions, but not really tailored to our setting because in the third and forth turn, we need more information to decide the relation than just the two entities.For instance, to extract Position, we need both Person and Company, and to extract Time, we need Person, Company and Position.This is akin to a dependency parsing task, but at the tag-level rather than the word-level (Dozat and Manning, 2016;Chen and Manning, 2014).We thus proposed the following baseline, which modifies the previous entity+relation strategy to entity+dependency, denoted by tag-ging+dependency.We use the BERT tagging model to assign tagging labels to each word, and modify the current SOTA dependency parsing model Biaffine (Dozat and Manning, 2016) to construct dependencies between tags.The Biaffine dependency model and the entity-extraction model are jointly trained.
Results are presented in Table 6.As can be seen, the tagging+dependency model outperforms the tagging+relation model.The proposed multi-turn QA model performs the best, with RL adding additional performance boost.Specially, for Person extraction, which only requires single-turn QA, the multi-turn QA+RL model performs the same as the multi-turn QA model.This is also the case in tagging+relation and tagging+dependency.

Results on ACE04, ACE05 and CoNLL04
For ACE04, ACE05 and CoNLL04, only two turns of QA are required.

Effect of MRC
Comparing with the sequence labeling approach that most previous work used, the advantage of QA formalization is that the query encodes additional information which potentially help the extraction model.It is interesting to see how much benefit the QA formalization introduces.
We benchmark the QA model (denoted by BERT QA) against the sequence labeling model (denoted by BERT tagging) on the entity extraction task.To enable apples-to-apples comparison, both models use BERT as backbones, outputting BMES labels for each token, with the only difference being whether a query question is presented.We fine-tune both models on the name entity recognition task with different datasets, which amounts to the entity extraction stage of our task.Results are given as follows: As can be seen, the entity extraction model significantly benefits from the QA formalization: the BERT QA outperforms the BERT sequencelabeling model in F1 score by +1.2 on RESUME, +1.5 on ACE04, +0.6 on ACE05 and +0.9 on CoNLL2004.

Effect of Question Generation Strategy
In this subsection, we compare the effects of natural language questions and pseudo-questions.Results are shown in Table 8.We can see that natural language questions lead to a strict F1 improvement across all datasets.This is because natural language questions provide more fine-grained semantic information and can help entity/relation extraction.By contrast, the pseudo-questions provide very coarse-grained, ambiguous and implicit hints of entity and relation types, which might even confuse the model.

Case Study
Table 9 compares outputs from the proposed multiturn QA model (without RL) with the ones of the previous SOTA MRT model (Sun et al., 2018).In the first example, MRT is not able to identify the relation between john scottsdale and iraq because the two entities are too far away, but our proposed QA model is able to handle this issue.In the second example, the sentence contains two pairs of the same relation.The MRT model has a hard time handling this situation, unable to locate the ship entity and the associative relation, whereas the multiturn QA model is able to.

Conclusion
In this paper, we propose a multi-turn QA paradigm for the task of entity-relation extraction.We achieve state-of-the-art results on 3 benchmark datasets.We also construct a new entity-relation extraction dataset that requires hierarchical relation reasoning and the proposed model works best.
way to bind the entity and the relation extraction models is to use reinforcement learning or Minimum Risk Training, in which the training signals are given based on the joint decision by the two models.Sun et al. (2018) optimized a global loss function to jointly train the two models under the framework work of Minimum Risk Training.Takanobu et al. (

Table 1 :
An illustration of an extracted structural table.
It kept the PER-SOC, ART and GPE-AFF categories from ACE04 but split PHYS into PHYS and a new relation category PART-WHOLE.It also deleted DISC and merged EMP-ORG and OTHER-AFF into a new category EMP-ORG.As for CoNLL04, it defines four entity types (LOC, ORG, PERand OTH-ERS) and five relation categories (LOCATED IN, WORK FOR, ORGBASED IN, LIVE IN ]and KILL).

Table 2 :
Statistics for the RESUME dataset.

Table 3 :
To generate Question templates for different entity types of AEC.

Table 4 :
Some of the question templates for different relation types in AEC.During which period did e 1 work for e 2 as e 3 A: e 4

Table 5 :
Question templates for the RESUME dataset.

Table 6 :
Results for different models on the RESUME dataset.

Table 7 :
Results of different models on the ACE04, ACE05 and CoNLL04 test set.

Table 8 :
Comparison of the effect of natural language questions with pseudo-questions.