Speeding up Reinforcement Learning-based Information Extraction Training using Asynchronous Methods

RLIE-DQN is a recently proposed Reinforcement Learning-based Information Extraction (IE) technique which is able to incorporate external evidence during the extraction process. RLIE-DQN trains a single agent sequentially, training on one instance at a time. This results in significant training slowdown which is undesirable. We leverage recent advances in parallel RL training using asynchronous methods and propose RLIE-A3C. RLIE-A3C trains multiple agents in parallel and is able to achieve upto 6x training speedup over RLIE-DQN, while suffering no loss in average accuracy.


Introduction
Extracting information about an event (or entity) involves multiple decisions, as one first needs to identify documents relevant to the event, extract relevant information from those documents, and finally reconcile various values obtained for the same relation of the event from different sources (Ahn, 2006). Search based methods for Information Extraction have been increasingly investigated (West et al., 2014); (Hegde and Talukdar, 2015); (Zhang et al., 2016); (Bing et al., 2017). (Kanani and McCallum, 2012) combine search and Information Extraction (IE) using Reinforcement Learning (RL), with the goal of selecting good actions while staying within resource constraints, but don't optimze for extraction accuracy. More recently, (Narasimhan et al., 2016) proposed a RL-based approach to model the IE process outlined above. We shall refer to this approach as RLIE-DQN in this paper. RLIE-DQN trains an RL * Research carried out during an internship at the Indian Institute of Science, Bangalore. agent using Deep Q-Network (DQN) (Mnih et al., 2015) to select optimal actions to query for documents and also reconcile extracted values.
DQN trains a single agent sequentially, updating parameters based on one instance at a time. Each such instance is sampled from the entire training data, also called the experience replay. Such sequential experience replay-based training results in slow learning, while requiring high memory and computation resources. In order to overcome this challenge, A3C (Asynchronous Advantage Actor-Critic), an asynchronous deep RL training algorithm, has been proposed recently (Mnih et al., 2016). A3C trains multiple RL agents in parallel, each of which estimates gradients locally, and asynchronously updates globally shared parameters. Recent work has explored applications for A3C in varied domains (Fernando et al., 2017), (Mirowski et al., 2016).
In this paper, we propose RLIE-A3C which uses A3C-based parallel asynchronous agents for training. This is in contrast to the sequential DQN training in RLIE-DQN. Differences between the training regimes of the two methods are shown in Figure 1. Through experiments on multiple realworld datasets, we find that RLIE-A3C achieves upto 6x training speedup compared to RLIE-DQN, while suffering no loss in accuracy. To the best of our knowledge, this is the first application of asynchronous deep RL methods in IE (and also in NLP), and we hope this paper will foster further adoption and research into such methods in the NLP community. RLIE-A3C code is available at https://github.com/adi-sharma/RLIE A3C Figure 1: Left: DQN-based sequential learning framework used in RLIE-DQN (Narasimhan et al., 2016), as discussed in Section 2. At each time step, the agent looks at a specific instance from the training data. Right: A3C-based parallel learning framework in RLIE-A3C (proposed approach). The parallel agents look at different parts of the training data, estimate parameter update statistics locally, and then perform asynchronous updates on the globally shared parameters (θ t ) at time step t. See Section 3 for details. Due to the asynchronous parallel updates, RLIE-A3C achieves significant training speedup without loss in accuracy, as we shall see in Section 4.

RLIE-DQN: Information Extraction using Reinforcement Learning
We first present a brief overview of RLIE-DQN (Narasimhan et al., 2016). Given a document to extract information about an event, RLIE-DQN issues a search query to retrieve other documents related to the event, extracts event information from those documents, and finally reconciles values extracted from the documents. If confidence in the extracted values are low, then RLIE-DQN repeats this process with additional queries. This way, RLIE-DQN incorporates evidences from external sources to improve information extraction (IE) from a given source document.
RLIE-DQN models the task as a Markov Decision Process (MDP) in order to reconcile newly extracted information selectively and dynamically. The MDP describes the environment in which the RL agent learns to make decisions. The MDP is represented using a tuple S, A, R, T , where S is the set of states, A = {a d , a q } is the set of actions, R(s, a) is the reward for taking action a ∈ A from state s ∈ S, and T (s |s, (a d , a q )) is the state transition function. Here, a d is the reconciliation action, while a q is the query action. Based on a d , the agent may accept extracted values for one or all relations, reject all newly extracted values or stop (episode ends).
A sample state transition in RLIE-DQN's MDP is shown in Figure 2. State representation consists of many details such as confidence scores of current and newly extracted relation values, context statistics from which the extractions are performed, etc. But for better readability, only the set of current and new values are shown for the states in this figure. Each transition consists of two actions: reconcile decision and query. The RL agent uses the reconcile action (a d ) to update value of the ShooterName relation from Paul Kiska to Kevin Wardzala. The agent uses the query action (a q ) to issue a new query ("Cleveland shooting" + "injured") to retrieve other relevant documents and extract new values 3 and 1 for relations NumKilled and NumWounded, respectively. The transitions stop whenever a d is a stop decision.
The reward function at each state, is the cummulative difference of current and previous extracted accuracies, summed over all relations. Also, a negative reward per step is added to the reward in order to penalize the agent for longer episodes, since issuing queries to a search engine is expensive. Deep Q-Learning (DQN): Let Q(s, a) be the measure of long-term cumulative reward obtained by taking action a from state s. The RL agent makes use of Q(s, a) to select the next action from a from state s. Q-learning (Watkins and Dayan, 1992) is a popular technique to estimate this function, in which the function for optimal Q-value is estimated using the Bellman equation (Sutton and Barto, 1998 Here, R(s, a) is the immediate reward and γ is the discount factor for the value of future rewards.
For high dimensional state spaces, Deep Q-Network (DQN) (Mnih et al., 2013) approximates Q(s, a) as Q(s, a; θ) using parameters θ of a deep network. RLIE-DQN used such a DQN-based agent to learn optimal policy, as shown in Figure 1.

Proposed Approach: RLIE-A3C
The DQN-based agent used in RLIE-DQN is sequential, as it moves from one instance to another to update parameters. This can result in significant training time slowdown, especially in large data settings. Instead of using experience replay of the DQN algorithm for stabilizing updates, we consider a framework with multiple asynchronous agents, each of which explore different areas of the environment in parallel.
Asynchronous Advantage Actor-critic (A3C) (Mnih et al., 2016) is a recently proposed deep RL algorithm which makes use of parallel agents for parameter estimation. We replace DQN with A3C in RLIE-DQN and call the resulting method RLIE-A3C -our proposed approach (See Figure 1).
At time instant t, RLIE-A3C poses decision policy π d (a d |s t ), and query policy π q (a q |s t ) as probability distributions over candidate actions a d and a q , respectively. RLIE-A3C also calculates the state value function V (s t ), as an estimate of the cumulative long-term reward obtained starting from state s t . Owing to the large continuous state space of the problem, RLIE-A3C approximates each of these three functions using three separate deep neural networks, which are parametrized by θ d , θ q , and θ v , as π d (a d |s t ) ≈ π d (a d |s t ; θ d ), π q (a q |s t ) ≈ π q (a q |s t ; θ d ) and V (s t ) ≈ V (s t ; θ v ). These parameters are updated by agents working in parallel. Based on the policies, each agent selects the query and decision actions to be performed and updates it's state accordingly. This is repeated up to t max steps or until a terminal state is reached. Local Gradient Calculation: The agents estimate parameter gradients using a local copy of the network parameters, and then perform asynchronous updates on the globally shared parameters θ d , θ q , and θ v . Hence, the policy and value functions are jointly estimated. The policy gradient update equations calculated over the local copy of network parameters of each parallel RLIE-A3C agent p are as follows: where, x ∈ {d, q} for decision and query. The advantage function A(s t , a t ; above significantly reduces the variance of the policy gradient, where γ is the discount factor and k can vary from state to state and is upper-bounded by t max . Since the gradient updates are accumulated, training stability increases. Further, exploration is encouraged by introducing the entropy regularization term β x ∇ θ p x H(π x (s t ; θ p x )) in the equation above. Here, H is the entropy function and β x controls dominance of the entropy term.
The gradient update for the parameters of value function V is calculated locally by every parallel agent p as follows: ] is the return. Global Parameter Update: The parameters θ d , θ q and θ v are learnt using stochastic gradient descent with RMSprop (Tieleman and Hinton, 2012). The standard non-centered RMSProp update is used by the parallel agents p to update the shared parameters asynchronously using the gradients obtained from Equation (1), as follows: where x ∈ {d, q, v}, α is the decay factor, η is the learning rate, is the smoothing constant and g is the moving average of element-wise squared gradients. The pseudo-code for RLIE-A3C can be found in the Appendix.

Experiments and Results
Setup: We compare RLIE-A3C against RLIE-DQN using the same protocol and hyperparameters as reported in (Narasimhan et al., 2016). Also, we experiment with the same two datasets used in that paper: the Gun Violence Archive 1 and the Foodshield EMA database 2 . The train, dev and test datasets contain 372, 146 and 146 source articles respectively for the Shooting incident cases and 292, 42 and 148 source articles respectively for the food adulteration cases. We used the implementation of RLIE-DQN provided by the authors of that system. For more details on the dataset and other parameters, we refer the reader to (Narasimhan et al., 2016).

RLIE-A3C
: This is our proposed method which is described in Section 3. For the sake of fair comparison, the network, base classifier, and evaluation metrics are same as that of RLIE-DQN. Each of the three deep networks in RLIE-A3C, one each for π(a d |s), π(a q |s) and V (s), is built using two linear layers with 20 hidden units, followed by Re-LUs. MaxEnt classifier is used as the base extractor, and the model is evaluated on the entire test set for 1.6 million steps. The dev set is used to tune all hyperparameters, which can be found in the Appendix. For RLIE-A3C, the evaluation is carried out 50 times after training and the average accuracy values are taken over the top 5 evaluations, as done in (Mnih et al., 2016).

Results and Discussion
Extraction Accuracy: Experimental results comparing extraction accuracies of RLIE-DQN and RLIE-A3C are presented in Figure 3 3 (left and middle panels). From this figure, we observe that there is no loss in average accuracy in transitioning from RLIE-DQN to RLIE-A3C (in fact there is a slight gain in case of the Adulteration dataset). Please note that, improvement in accuracy is not our primary goal in the paper -it is the training time speedup, as discussed next. Speedup: Training times of RLIE-DQN and RLIE-A3C over both datasets are compared in the right panel of Figure 3 4 . From Figure 3, we find that RLIE-A3C is able to achieve upto 6x training speedup over RLIE-DQN. In other words, RLIE-A3C is able to achieve significant speedup in training time over RLIE-DQN, without any loss in average accuracy. This is our main result. While RLIE-DQN was implemented in Torch, RLIE-A3C was implemented in Python using Ten-sorFlow framework. We note that Torch is known to be faster than TensorFlow (Bahrampour et al., 2015). This makes the speedup gains above even more impressive, and outlines the possibility that further gains may be possible with a Torch-based implementation of RLIE-A3C. Figure 4 shows evolution of test accuracy of RLIE-A3C for the four relations of the Shooting Incidents dataset. For this dataset, state value function of RLIE-A3C converged at 48 minutes. However, from Figure 4, we observe that accuracies converge to the final values much before that. Why do Asynchronous Methods work? An asynchronous approach fits more naturally with the training data, since different parallel agents look at different events at the same time (see Figure 1), and the model is able to exploit the regularities between events in the dataset. The gradient updates to the global network are less biased and the model is not easily distracted by noise in the data. The asynchronous parallel model is able to converge much faster as compared to replay memory based methods like DQN, as also seen in (Mnih et al., 2016).

Conclusion
In this paper, we proposed RLIE-A3C, an asynchronous deep Reinforcement Learning (RL) algorithm for Information Extraction (IE). In contrast to sequential training in previously proposed RLIE-DQN (Narasimhan et al., 2016), RLIE-A3C employs asynchronous parallel training. This results in upto 6x training speedup, without suffering any loss in average accuracy. We hope that this first application of asynchronous deep RL algorithms will open up more adoption of such techniques in the NLP community.