Knowledge-guided Open Attribute Value Extraction with Reinforcement Learning

Open attribute value extraction for emerging entities is an important but challenging task. A lot of previous works formulate the problem as a \textit{question-answering} (QA) task. While the collections of articles from web corpus provide updated information about the emerging entities, the retrieved texts can be noisy, irrelevant, thus leading to inaccurate answers. Effectively filtering out noisy articles as well as bad answers is the key to improving extraction accuracy. Knowledge graph (KG), which contains rich, well organized information about entities, provides a good resource to address the challenge. In this work, we propose a knowledge-guided reinforcement learning (RL) framework for open attribute value extraction. Informed by relevant knowledge in KG, we trained a deep Q-network to sequentially compare extracted answers to improve extraction accuracy. The proposed framework is applicable to different information extraction system. Our experimental results show that our method outperforms the baselines by 16.5 - 27.8\%.


Introduction
Numerous entities are emerging everyday. The attributes of the entities are often noisy or incomplete, even missing. In the field of electronic commerce, target attributes (e.g., brand, flavor, smell) of new products are often missing (Zheng et al., 2018). In medical analysis, attributes like transmission, genetics and origins of a novel virus are often unknown to people. Even in DBpedia, a well-constructed and large-scale knowledge base extracted from Wikipedia, half of the entities contain less than 5 relationships (Shi and Weninger, * Ye Liu and Sheng Zhang contributed equally. † Rui Song and Yanghua Xiao are corresponding authors. Yanghua Xiao was supported by Shanghai Science and Technology Innovation Action Plan (No.19511120400). 2018). A method that is capable of supplementing reliable attribute values for emerging entities can be highly useful in many applications.
Although information extraction methods have been extensively studied, the task of open attribute value extraction remains challenging. First, the emerging entities may have new attribute values that are absent in the existing KG. Under such circumstances, the prediction methods under the closed-world assumption and the methods that cannot utilize external information are not well suited due to their limited recalls. Second, while web corpus can be used as a good resource to provide relatively updated and relevant articles for large varieties of emerging entities, the articles retrieved from web corpus can be noisy and/or irrelevant, which in turn leads to a limited precision. Finally, even when articles are relevant, the extracted answers might still be inaccurate due to the errorprone information extraction model.
To effectively filter out noisy answers that are obtained either due to the irreverent articles or the errors incurred by the information extraction system, we pose the following two questions: First, how many articles should we collect from the enormous web corpus? Second, how to select the most reliable value out of the pool of all the possible answers extracted from the articles?
There is no common answer to the first question that works for all triplets because of the inconsistent degrees of difficulties in finding the correct attribute values. The decision of when to stop querying more external articles needs to be made after successive evaluations of the candidate answers. Thus the decision making process is inherently sequential.
Reinforcement learning (RL) is a commonly adopted method to deal with sequential decision problems and has been widely studied in the field of robotic and game (Sutton et al., 1998). But Figure 1: Illustration of overall process. The inputs are pairs of entities and attributes. Relevant articles are retrieved via search engines. The articles together with the KG are fed into the RL agent to inform the selection between candidate answers and the stopping decisions. When the RL agent decides to stop, it will output the best extracted answer.
there are not many researches on open attribute value extraction with RL. One existing literature of RL-based method for value extraction is proposed by (Narasimhan et al., 2016). In their work, a RL framework is designed to improve accuracy of event-related value extraction by acquiring and incorporating external evidences. However, their approach requires a great amount of context information about the specific event of interest during the training process. It is not trivial to extend their framework for open attribute value extraction, because we would need to collect context words and train a new model with annotated data for each emerging attribute. Therefore, their framework cannot be generalized to open attribute value extraction task when various entities and attributes are involved.
While using the context words to construct the states in RL is not suitable in our task, our solution is to leverage the rich, well-organized information in KG, which is not only informative but also generalizable. Such information can be leveraged in answer comparisons, which addresses our second question. For example, to fill the incomplete triplet < iPhone 11, display resolution, ?>, from the KG we may find that the attribute values "display resolutions" of an entity that is under category "Phone" is commonly expressed in the format of "xxx by xxxx Pixels", where x stands for some digit. The typical instances of the attribute values for entities under the same category provide valuable background information.
In this paper, we propose a knowledge-guided RL framework to perform open attribute value extraction. The RL agent is trained to make good actions for answer selection and stopping time decision. Our experiments show that the proposed framework significantly boosts the extraction performance.
To the best of our knowledge, we are the first to integrate KG in a RL framework to perform open attribute value extraction In summary, our contribution are in three folds: • We construct a novel knowledge-guided RL framework for open attribute value extraction task.
• We provide a benchmark data set for open attribute value extraction task.
• Our method achieves a significantly better performance than the state-of-the-art methods.

Problem Definition
We denote the entity-attribute-value triplet as < e, r, v >. The goal is to find the attribute value in an incomplete triplet < e, r, ? >. To achieve this purpose, we pose a question generated with a pre-defined template to search engine to obtain relevant articles. For example, to fill the incomplete triplet < GTX1080, Core code, ?>, we retrieve articles with the query "What is the core code of GTX1080?". An information extraction system, such as QANet , is used to extract a candidate answer with a certain confidence score from an article. However, due to the inconsistent qualities of the online articles and the inevitable errors caused by the information extraction system, the results of extracted from only one online article is not satisfactory in many cases. Another source of information that can be leveraged help to fulfill such a task is a KG. While it is hard to find out the attribute values for an emerging entity given the existing ones, the KG can serve as the background knowledge about the attributes. We approach the problem using a reinforcement learning framework that is illustrated in the next section.

System Overview
Our procedure is summarized in Figure 1. We use <GTX1180, core code, ?> as an example for illustration. The query "What is the core code of GTX1180?" is posed to the search engine to obtain a collection of relevant articles by downloading the top M headlines and bodies in the searching page. M is a pre-determined parameter that controls the maximum capacity of the retrieved articles. For each of the retrieved articles, we use an information extraction system to extract a candidate answer. In our example, RTX2080 is extracted with a confidence score of 0.30 from the first queried article and GV104 is extracted with a confidence score of 0.25 from the second article. Given the first two candidate answers, the RL decides on which answer to pick and whether more articles need to be retrieved.
To make such decisions, in addition to the confidence evidence from the information extraction system, the relevant facts in the KG will be fed into the RL agent to serve as the background knowledge about the attribute. For a triplet < e, r, ? >, we consider v r as a reference value with respect to the attribute r if there is a triplet < e , r, v r > and e, e belong to the same category 1 in the KG. In our example, since GTX1180 belongs to the category NVIDIA GPU, and so does GTX1080 and GTX980, the reference values are retrieved from the fact that the core code of GTX1080 is GP104 and the core code of GTX980 is GM204. Guided by the KG, the RL agent makes successive evaluations and finally outputs the predicted candidate attribute value via a policy network such as DQN (Mnih et al., 2015). 1 The category information is obtained from concept of CN-DBpedia. The knowledge base contains multi-level hierarchy of categories. We use the lowest-level (most specific) category in the hierarchy to derive the reference values.

Reinforcement Learning for Open Attribute Value Extraction
The attribute value extraction task is modeled as a Markov decision process (MDP), where the RL agent is actively engaged in the decision making process to maximize the reward, which measures the correctness of the extracted attribute values. The MDP is modeled as a tuple (S, A, T, R), where S = {s} is the space of all possible realvalued vector states; A = {a} is the set of actions; T (s |s, a) refers to a transition function that maps the domain of state and action to a probability distribution of states; R(s, a) is a reward function that maps the domain of state and action to a real number, which is encoded such that the higher value the better. We describe our RL methodology by illustrating these components as follows.
Action and transition At each decision stage, the agent will observe two candidate answers from two articles and make decisions to answer the two questions: (i) which answer is better out of the two? (ii) should the agent stop at the current best answer or continue querying more articles? At the initial decision point, two candidate answers are obtained from two articles simultaneously queried from the web, where we arbitrarily assign one of them to be the current best answer and the other the new candidate answer.
We define the following three actions in A: 1. Retain: (i) retain the current best answer and discard the new answer; (ii) query next.

2.
Replace: (i) replace the current best answer with the new answer; (ii) query next.
3. Stop: (i) select the current best answer as the final answer; (ii) stop the query.
At all subsequent decision points, we will retain or replace the current best answer and continue comparing with the new candidate answers queried from the web until the action is "Stop".
State At each decision point, the state is constructed by concatenating the following three components, where different sources of information are combined.
(1) State variables associated with the confidence scores. The first component is the confidence scores associated with the two candidate answers, which are defined by the information extraction system. We consider this part as the signal The current best answer and the next candidate answer are extracted from the retrieved articles with an information extraction system. The extracted answers are embedded into a state vector from the state embedding process. The state vector is fed into a policy network (DQN). The policy network selects the optimal actions and outputs the current best result. On the right panel, it presents the state embedding process. The extracted candidate answers are embedded into a state vector via similarity metrics and confidence scores.
of the goodness of the extracted answers related to the articles.
(2) State variables informed by the KG. The second component leverages knowledge from the reference values. For a given attribute, we expect the attribute values to be similar to each other in lexical sense. In order to capture such information, we first construct 7 features based on 2 string lexical similarity metrics. For each of the 7 features, we take the average and maximum of the features for each of the two candidate answers as state variables.
String similarity metrics The two string lexical similarity metrics as follows: where L(s 1 , s 2 ) refers to Levenshtein distance (Levenshtein, 1966). It measures how different two strings are by counting the number of deletions, insertions or substitutions required to transform s 1 into s 2 . L Sim(s 1 , s 2 ) is known as the Levenshtein similarity and LCS(s 1 ,s 2 ) stands for the longest common sub-string of s 1 and s 2 (Gusfield, 1997).
Features based on similarity We define the following 7 features to capture the similarity between two strings from different aspects: • f 1 : L Sim between s 1 and s 2 ; • f 2 : LCS Sim between s 1 and s 2 ; • f 3 : L Sim between s 1 and s 2 with numbers removed from s 1 , s 2 ; • f 4 : LCS Sim between s 1 and s 2 with numbers removed from s 1 , s 2 ; • f 5 : L Sim between s 1 and s 2 with s 1 , s 2 wildcard masked 2 ; • f 6 : LCS Sim between s 1 and s 2 with s 1 , s 2 wildcard masked; • f 7 : The difference in the length of s 1 and s 2 in characters.
Construction of state variables by the KG For each of the 7 features, given the reference values in V r and the two candidate answers, answer 1 and answer 2 , we form the 28 state variables in this part by taking averages and maximums, which is specified as follows: f i (answer 2 , v r ), for i = 1, . . . , 7. This is the part of the state where knowledge from KG is used to inform the decision of the RL agent.
(3) State variables based on the candidate answers The third component contains the Levenshtein similarity between the two candidate answers. Intuitively, when the confidence scores of both candidate answers are high and they are very similar to each other, then it shows some positive signal for stopping.
The components (1) -(3) are concatenated together to construct the 31-dimensional state vector to carry information from different perspectives.
Reward The reward is set to 0 when the query process is ongoing; only at the final stage when the query is terminated, a nonzero reward is received, which measures the similarity between the final answer and the correct answer.
wherev is the selected best attribute value and v is the true attribute value.
Method Since the state defined in our framework is from a continuous space, we adopt a deep Q-network (DQN) to approximate Q(s, a) with a deep neural network denoted by Q(s, a; θ). Specifically, we parameterize an approximate value function Q(s, a; θ) using a three-layer deep neural network. The network takes the continuous 31-dimensional state vector s as input and predict Q(s, a). We use the rectified linear unit (ReLU) activation functions in the hidden layers. The architecture is illustrated in Figure 2.
Algorithm 1 provides complete details of our MDP framework for the DQN training phase.

Experiments
In this section, we compare our proposed RL framework to the state-of-the-art extraction-based baselines, demonstrating its robustness and ability to obtain accurate answers for missing attribute values. Our codes are publicly available online. 3

Data
The dataset is generated from existing triplets using the largest public Chinese knowledge base, CN-DBpedia (Xu et al., 2017), with a corresponding taxonomy  Specially, the number of Algorithm 1 The full details of our training Phase for the DQN agent with -greedy exploration.
1: Initialize a set of training triplets x i =< e i , r i , v i >∈ X 2: Initialize parameters θ randomly 3: Initialize replay memory D 4: for x i ∈ X do Store transition (s t , a t , r t , s t+1 ) in D 31: Sample random mini batch of transitions (s t , a t , r t , s t+1 ) from D 32: y t = r t if a t is "Stop" else r t + γmax a Q(s t+1 , a ; θ)

34:
if a t is "Stop" then break training triplets is 1022. The selected entities in the experiment are from four different fields, including GPU, game, movie and phone. The testing data contains 75 triplets for each field, hence the total number of triplets in the testing is 300. For each triplet in the training and testing data, we download articles from top M = 10 links obtained from the Baidu search engine. The CN-DBpedia is used as our external KG with the triplets in training and testing masked and to provide reference values.

Reinforcement Learning Implementation
In the RL setting, we use DQN to train the policy. Specifically, the DQN contains three layers of multilayer perceptron (MLP). The dimensions of hidden layers in MLP are chosen as 10 and 5 respectively. The dimension for the output is 3 which represents the three actions. In our experiments, the DQN model is trained 100 epochs where each epoch contains 1,000 transitions. We use a decreasing learning rate with epochs during the training process 5 . The in -greedy exploration is annealed from 1 to 0.02 over 10,000 transitions. The replay memory D is of size 10,000. We deploy our RL model in RLlib  for efficiently distributed computation.

Competitors
We experimented with three traditional aggregation methods and four variants of RL agents as the competitors during the experiment.
Traditional Aggregation Methods 1. Random choice (Random): We randomly select an article out of M articles and extract an answer from it with the information extraction system as the final answer.

First article (First):
We use the answer extracted from the article that ranked first in the search engines.
3. Majority aggregation (Majority): We use a majority vote strategy over all the extracted answers.
4. Confidence aggregation (Confidence): The answer with the highest confidence score out is chosen as the final answer. This aggregation method is only feasible when each candidate answer is associated with a confidence score.
Variations of the RL framework 1. RL-NK: (No KG included) The RL agent do not leverage the information from KG. The KGdependent part (i.e. the component (2) in state construction) is omitted from the state.
2. RL-NR: (No retain or replace actions) The only action in the RL framework is Stop. The final answer is the one with the highest confidence score among candidate answers seen before stop.
3. RL-NS: (No stop action) The RL agent do not make Stop decisions. All of the M extracted candidate answers are compared.
4. RL-KG: Our proposed RL framework. Since the sequence labeling methods cannot provide valid confidence scores associated with the candidate answers, the answers extracted with these methods are aggregated using Random and Majority strategies. For the MC models, we implemented all the aggregation strategies including our RL-based methods.

Results
Our evaluating metric is the Levenshtein similarity between the final answer and the ground truth, which ranges from 0 to 1 and higher score represents better performance. The results are summarized in Table 1 when different information extraction systems are combined with different aggregation strategies. The results are evaluated separately under each field and the combined results are also reported in the tables. All results reported are averaged over 3 independent runs. The oracle performances are provided to differentiate the error incurred by imperfect decisions and the inherent errors caused by the information extraction system.  Table 1: Accuracy of the baseline methods and our proposed methods. Bold indicates best baseline performances with a sequence labeling methods and best results achieved with BiDAF/QANet/BERT. The Oracle performance shows the best possible performance when perfect decisions are made. Our proposed RL-KG improves the extraction performances substantially.
From Table 1, we have the following observations. First, the RL based methods outperform all the competing baseline methods. By adopting the RL framework instead of traditional aggregation methods, the accuracies are boosted substantially. It demonstrate the effectiveness of the RL framework. Second, compared to the RL framework without the guide of KG (RL-NK), our proposed RL-KG framework achieves significantly better results. This suggests that the KG does provide valuable information in the task of the attribute value extraction. Third, the RL-KG framework outperforms all the other variants of the RL framework. It shows that considering answer selection and stopping decisions at the same time achieves the best performances.
We also conduct an experiment to see how our method performs when the KG is not able to provide information for some triplets, which is a common situation in reality. During the experiment, we randomly set the reference values for 0 − 100 (incremented by 10) percent of triplets in the training and testing as empty. For those triplets that do not have reference values, the state variables associated with the KG (i.e. the component (2) in our state construction) are set to 0. Figure 3 displays how the performances change when information from KG is leveraged with different levels of frequency. It can be seen that for all the three information extraction models, the performances are getting better as the KG is used at higher frequencies.

Case Study
By incorporating information from KG, RL agent is able to rule out some unreasonable answers and boost the extraction accuracy.
In Table 2, we present some cases where the trained RL agent helps to correct the information extraction errors. More details for the first two examples are included in Table 3. For the first example in Table 3, the emerging entity is Founder r680-470 and the attribute of interest is operating system. The reference values retrieved from the KG include values like IOS, DOS, Andriod, EMUI. The trained RL agent stops after querying the six raw corpus and selects the answer DOS over the previous candidate values, which is exactly one of the reference values. In the second example in Table 3, we are interested in the release date for the emerging entity super puzzles game. The reference val-  Machine reading comprehension (MRC) and automated question (QA) answering are important and longstanding topic in NLP research due to its huge potentials in wide variety of applications. An end-to-end MRC QA models are expected to have the ability to read a piece of text and then answer questions about it. Significant progress has been made with the machine reading and QA task in recent years. Some notable works include BiDAF (Seo et al., 2016), SAN (Liu et al., 2017), QANet, ALBERT (Lan et al., 2019).
Our proposed framework can also be regarded an end-to-end MRC QA model that is built on top of an existing MRC QA model, which is used as the information extraction system in our extraction process. Different from most of the previous works, our focus is to enhance the performance of an ex-isting model by utilizing external information from KG and by acquiring more articles when the agent does not feel confident about the extracted answer.

Open-world knowledge graph completion
Attribute value extraction under the open world assumption has received many attentions in NLP community recently. There has been quite a few works on open attribute value extraction. Open-Tag (Zheng et al., 2018) formalized the extraction problem as a sequence tagging task and proposed an end-to-end framework for open attribute value extraction. The open-world KGC (Shi and Weninger, 2018) used a complex relationship dependent content masking architecture to mitigate the presence of noisy text descriptions and extract the attribute value from the denoised text. TXtract (Karamanolakis et al., 2020) incorporated the categorical structure into the value tagging system. However these methods suffer from irrelevant articles and is not able to filter out noisy answers. (Sutton et al., 1998) is a framework that enables agents to reason about sequential decision making as an optimization process. It has been widely applied in NLP tasks, including article summarization (Paulus et al., 2017;Celikyilmaz et al., 2018), dialogue generation (Li et al., 2016a;Serban et al., 2017;Li et al., 2019), and question answering Das et al., 2019) and so on. To the best of our knowledge, we are the first to integrate information from KG into a RL framework to fulfill the attribute extraction task.

Conclusion and discussion
This paper presents a novel RL framework to perform open attribute value extraction. Through a set of experiments, we observe that the most of the computation cost is incurred by training the information extraction system. The remaining computation cost from RL framework is comparably small during both the training and the prediction process. Specifically, during our experiments, we trained a three-layer deep neural network model, which has much fewer parameters compared to the information extraction system. The proposed RL method demonstrates promising performance, where the KG showed its ability to provide guidance in open attribute extraction task. Our framework also contributes to areas of knowledge graph completion and automatic question-answering for attribute values.
KG has huge potential to provide rich background information in many NLP applications. Our solution for attribute value extraction can be extended to other NLP tasks. A potential attempt might be to use KG to design the reward in the RL framework to provide weak supervision. We leave this as our future work.