Deep Reinforcement Learning for Chinese Zero Pronoun Resolution

Recent neural network models for Chinese zero pronoun resolution gain great performance by capturing semantic information for zero pronouns and candidate antecedents, but tend to be short-sighted, operating solely by making local decisions. They typically predict coreference links between the zero pronoun and one single candidate antecedent at a time while ignoring their influence on future decisions. Ideally, modeling useful information of preceding potential antecedents is crucial for classifying later zero pronoun-candidate antecedent pairs, a need which leads traditional models of zero pronoun resolution to draw on reinforcement learning. In this paper, we show how to integrate these goals, applying deep reinforcement learning to deal with the task. With the help of the reinforcement learning agent, our system learns the policy of selecting antecedents in a sequential manner, where useful information provided by earlier predicted antecedents could be utilized for making later coreference decisions. Experimental results on OntoNotes 5.0 show that our approach substantially outperforms the state-of-the-art methods under three experimental settings.


Introduction
Zero pronoun, as a special linguistic phenomenon in pro-dropped languages, is pervasive in Chinese documents (Zhao and Ng, 2007). A zero pronoun is a gap in the sentence, which refers to the component that is omitted because of the coherence of language. Following shows an example of zero pronoun in Chinese document, where zero pronouns are represented as "φ".
[当事人 李亚鼎] 除了 表示 φ1 欣然 接受 但 φ2 也 希望 国家 要 有 人 负责。 A zero pronoun can be an anaphoric zero pronoun if it coreferes to one or more mentions in the associated text, or unanaphoric, if there are no such mentions. In this example, the second zero pronoun "φ 2 " is anaphoric and corefers to the mention "当 事 人 李 亚 鼎/Litigant Li Yading" while the zero pronoun "φ 1 " is unanaphoric. These mentions that contain the important information for interpreting the zero pronoun are called the antecedents.
In recent years, deep learning models for Chinese zero pronoun resolution have been widely investigated (Chen and Ng, 2016;Yin et al., , 2016. These solutions concentrate on anaphoric zero pronoun resolution, applying numerous neural network models to zero pronoun-candidate antecedent prediction. Neural network models have demonstrated their capabilities to learn vectorspace semantics of zero pronouns and their antecedents (Yin et al., , 2016, and substantially surpass classic models (Zhao and Ng, 2007;Ng, 2013, 2015), obtaining state-of-theart results on the benchmark dataset.
However, these models are heavily making local coreference decisions. They simply consider the coreference chain between the zero pronoun and one single candidate antecedent one link at a time while overlooking their impacts on future decisions. Intuitively, antecedents provide key linguistic cues for explaining the zero pronoun, it is therefore reasonable to leverage useful information provided by previously predicted antecedents as cues for predicting the later zero pronoun-candidate antecedent pairs. For instance, given a sentence "I have confidence that φ can do it." with its candidate mentions "he", "the boy" and "I", it is challenging to infer whether mention "I" is pos-sible to be the antecedent if it is considered separately. In that case, the resolver may incorrectly predict "I" to be the antecedent since "I" is the nearest mention. Nevertheless, if we know that "he" and "the boy" have already been predicted to be the antecedents, it is uncomplicated to infer the φ-"I" pair as "non-coreference" because "I" corefers to the disparate entity that is refered by "he". Hence, a desirable resolver should be able to 1) take advantage of cues of previously predicted antecedents, which could be incorporated to help classify later candidate antecedents and 2) model the long-term influence of the single coreference decision in a sequential manner.
To achieve these goals, we propose a deep reinforcement learning model for anaphoric zero pronoun resolution. On top of the neural network models (Yin et al., , 2016, two main innovations are introduced that are capable of efficaciously leveraging effective information provided by potential antecedents, and making long-term decisions from a global perspective. First, when dealing with a specific zero pronoun-candidate antecedent pair, our system encodes all its preceding candidate antecedents that are predicted to be the antecedents in the vector space. Consequently, this representative vector is regarded as the antecedent information, which can be utilized to measure the coreference probability of the zero pronoun-candidate antecedent pair. In addition, the policy-based deep reinforcement learning algorithm is applied to learn the policy of making coreference decisions for zero pronoun-candidate antecedent pairs. The innovative idea behind our reinforcement learning model is to model the antecedent determination as a sequential decision process, where our model learns to link the zero pronoun to its potential antecedents incrementally. By encoding the antecedents predicted in previous states, our model is capable of exploring the longterm influence of independent decisions, producing more accurate results than models that simply consider the limited information in one single state. Our strategy is favorable in the following aspects. First, the proposed model learns to make decisions by linguistic cues of previously predicted antecedents. Instead of simply making local decisions, our technique allows the model to learn which action (predict to be an antecedent) available from the current state can eventually lead to a high-scoring overall performance. Second, instead of requiring supervised signals at each time step, deep reinforcement learning model optimizes its policy based on an overall reward signal. In other words, it learns to directly optimize the overall evaluation metrics, which is more effective than models that learn with loss functions that heuristically define the goodness of a particular single decision. Our experiments are conducted on the OntoNotes dataset. Comparing to baseline systems, our model obtains significant improvements, achieving the state-of-the-art performance for zero pronoun resolution. The major contributions of this paper are three-fold.
• We are the first to consider reinforcement learning models for zero pronoun resolution in Chinese documents; • The proposed deep reinforcement learning model leverages linguistic cues provided by the antecedents predicted in earlier states when classifying later candidate antecedents; • We evaluate our reinforcement learning model on a benchmark dataset, where a considerable improvement is gained over the state-of-the-art systems.
The rest of this paper is organized as follows. The next section describes our deep reinforcement learning model for anaphoric zero pronoun resolution. Section 3 presents our experiments, including the dataset description, evaluation metrics, experiment results, and analysis. We outline related work in Section 4. The Section 5 is about the conclusion and future work.

modelology
In this section, we introduce the technical details of the proposed reinforcement learning framework. The specific task of anaphoric zero pronoun resolution is to select antecedents from candidate antecedents for the zero pronoun. Here we formulate it as a sequential decision process in a reinforcement learning setting. We first describe the environment of the Markov decision making process and our reinforcement learning agent. Then, we introduce the modules. The last subsection is about the supervised pre-training technique of our model.  Figure 1: Illustration of our reinforcement learning framework. Given a zero pronoun with n candidate antecedents (presented as "NP"), for each time, the agent scores pairs of zero pronoun-candidate antecedent for their likelihood of coreference by 1) zero pronoun; 2) candidate antecedent and 3) antecedent information. Antecedent information at time t is generated by all the antecedents predicted in previous states.

Reinforcement Learning for Zero Pronoun Resolution
Given an anaphoric zero pronoun zp, a set of candidate antecedents are required to be selected from its associated text. In particular, we adopt the heuristic model utilized in recent Chinese anaphoric zero pronoun resolution work (Chen and Ng, 2016;Yin et al., , 2016 for this purpose. For those noun phrases that are two sentences away at most from the zero pronoun, we select those who are maximal noun phrases or modifier ones to compose the candidate set. These noun phrases ({np 1 , np 2 , ..., np n }) and the zero pronoun (zp) are then encoded into representation vectors: {v np 1 , v np 2 , ..., v npn } and v zp . Previous neural network models (Chen and Ng, 2016;Yin et al., , 2016 generally consider some pairwise models to select antecedents. In these work, candidate antecedents and the zero pronoun are first merged into pairs {(zp, np 1 ), (zp, np 2 ), ..., (zp, np n )}, and then different neural networks are applied to deal with each pair independently. We argue that these models only make local decisions while overlooking their impacts on future decisions. In contrast, we formulate the antecedent determination process in as Markov decision process problem. An innovative reinforcement learning algorithm is designed that learns to classify candidate antecedents incrementally. When predicting one single zero pronoun-candidate antecedent pair, our model leverages antecedent information generated by previously predicted antecedents, making coreference decisions based on global signals.
The architecture of our reinforcement learning framework is shown in Figure 1. For each time step, our reinforcement learning agent predicts the zero pronoun-candidate antecedent pair by using 1) the zero pronoun; 2) information of current candidate antecedent and 3) antecedent information generated by antecedents predicted in previous states. In particular, our reinforcement learning agent is designed as a policy network π θ (s, a) = p(a|s; θ), where s represents the state; a indicates the action and θ represents the parameters of the model. The parameters θ are trained using stochastic gradient descent. Compared with Deep Q-Network (Mnih et al., 2013) that commonly learns a greedy policy, policy network is able to learn a stochastic policy that prevents the agent from getting stuck at an intermediate state (Xiong et al., 2017). Additionally, the learned policy is more explainable, comparing to learned value functions in Deep Q-Network. We here introduce the definitions of components of our reinforcement learning model, namely, state, action and reward.

State
Given a zero pronoun zp with its representation v zp and all of its candidate antecedents representations {v np 1 , v np 2 , ..., v npn }, our model generate coreference decisions for zero pronoun-candidate antecedent pairs in sequence. More specifically, for each time, the state is generated by using both the vectors of the current zero pronoun-candidate antecedent pair and candidates that have been predicted to be the antecedents in the previous states. For time t, the state vector s t is generated as follows: where v zp and v npt are the vectors of zp and np t at time t. As shown in Chen and Ng (2016), humandesigned handcrafted features are essential for the resolver since they reveal the syntactical, positional and other relations between a zero pronoun and its counterpart antecedents. Hence, to evaluate the coreference possibility of each candidate antecedent in a comprehensive manner, we integrate a group of features that are utilized in previous work (Zhao and Ng, 2007;Ng, 2013, 2016) into our model. For these multivalue features, we decompose them into a corresponding set of binary-value ones. v f eaturet represents the feature vector. v ante (t) represents the antecedent information generated by candidates that have been predicted to be antecedents in previous states. After that, these vectors are concatenated to be the representation of state and fed into the deep reinforcement learning agent to generate the action.

Action
The action for each state is defined to be: corefer that indicates the zero pronoun and candidate antecedent are coreference; or otherwise, noncorefer. If an action corefer is made, we retain the vector of the counterpart antecedent together with those of the antecedents predicted in previous states to generate the vector v ante , which is utilized to produce the antecedent information in the next state.

Reward
Normally, once the agent executes a series of actions, it observes a reward R(a 1:T ) that could be any function. To encourage the agent to find accurate antecedents, we regard the F-score for the selected antecedents as the reward for each action in a path.

Reinforcement Learning Agent
Basically, our reinforcement learning agent is comprised of three parts, namely, the zero pronoun encoder that learns to encode a zero pronoun into vectors by using its context words; the candidate mention encoder that represents the candidate antecedents by content words; and the agent that maps the state vector s to a probability distribution over all possible actions. In this work, the ZP-centered neural network model proposed by  is employed to be the zero pronoun encoder. The encoder learns to encode the zero pronoun by its associated text into its vector-space semantics. In particular, two standard recurrent neural networks are employed to encode the preceding text and the following text of a zero pronoun, separately. Such a model learns to encode the associated text around the zero pronoun, exploiting sentence-level information for the zero pronoun. For the candidate mentions encoder, we adopt the recurrent neural network-based model that encodes these phrases by using its content words. More specifically, we utilize a standard recurrent neural network to model the content of a phrase from left to right. This model learns to produce the vector of a phrase by considering its content, providing our model an ability to reveal its vectorspace semantics. In this way, we generate the vector for zp, the v zp , and representation vectors of all its candidate antecedents, which are denoted as {v np 1 , v np 2 , ..., v npn }.
Moreover, we employ pooling operations to encode antecedent information by using the antecedents that are predicted in previous states. In particular, we generate two vectors by applying the max-pooling and average-pooling, respectively. These two vectors are then concatenated together. Let the representative vector of the tth candidate antecedent to be v npt ∈ R d , and the predicted antecedents at time t be writen as (2) Candidate Antecedents; (3) Pair Features and (4) Antecedents. By going through all the fullconnected hidden layers and one sof tmax layer, the agent maps the state vector into the probability distribution over actions that indicates the coreference likelihood of the input zero pronouncandidate antecedent pair.
The concatenation of these vectors is regarded as input and is fed into our reinforcement learning agent. More specifically, a feed-forward neural network is utilized to constitute the agent that maps the state vector to a probability distribution over all possible actions. Figure 2 shows the architecture of the agent. Two hidden layers are employed in our model, each of which utilizes the tanh as the activation function. For each layer, we generate the output by: where W i and b i are the parameters of the ith hidden layer; s i represents the state vector. After going through all the layers, we can get the representative vector for the zero pronoun-candidate antecedent pair (zp, np t ). We then feed it into a scoring-layer to get their coreference score. The scoring-layer is a fully-connected layer of dimension 2: where h 2 represents the output of the second hidden layer; W s ∈ R 2×r is the parameter of the layer and r is the dimension of h 2 . Consequently, we generate the probability distribution over actions using the output generated by the scoring-layer of the neural network, where a sof tmax layer is em-ployed to gain the probability of each action: p θ (a) ∝ e score(zp,npt) In this work, the policy-based reinforcement learning model is employed to train the parameter of the agent. More specifically, we explore using the RE-INFORCE policy gradient algorithm (Williams, 1992), which learns to maximize the expected reward: where p(a|zp, np t ; θ) indicates the probability of selecting action a.
Intuitively, the estimation of the gradient might have very high variance. One commonly used remedy to reduce the variance is to subtract a baseline value b from the reward. Hence, we utilize the gradient estimate as follows: Following Clark and Manning (2016), we intorduce the baseline b and get the value of b t at time t by E a t ∼p R(a 1 , ..., a t , ..., a T ).

Pretraining
Pretraining is crucial in reinforcement learning techniques (Clark and Manning, 2016;Xiong et al., 2017). In this work, we pretrain the model by using the loss function from : where P (np|zp i ) is the coreference score generated by the agent (the probability of choosing corefer action); A(zp i ) represents the candidate antecedents of zp i ; δ(zp, np) is 1 or 0, representing zp and np are coreference or not.

Dataset
Same to recent work on Chinese zero pronoun (Chen and Ng, 2016;Yin et al., , 2016, the proposed model is evaluated on the Chinese portion of the OntoNotes 5.0 dataset 1 that was used in the Conll-2012 Shared Task. Documents in this dataset are from six different sources, namely, Broadcast News (BN ), Newswires (N W ), Broadcast Conversations (BC), Telephone Conversations (T C), Web Blogs (W B) and Magazines (M Z). Since zero pronoun coreference annotations exist in only the training and development set (Chen and Ng, 2016), we utilize the training dataset for training purposes and test our model on the development set. The statistics of our dataset are reported in Table 1. To make equal comparison, we adopt the strategy as utilized in the existing work (Chen and Ng, 2016;, where 20% of the training dataset are randomly selected and reserved as a development dataset for tuning the model.

Evaluation Measures
Following previous work on zero pronoun resolution (Zhao and Ng, 2007;Chen and Ng, 2016;Yin et al., , 2016, metrics employed to evaluate our model are: recall, precision, and F-score (F). We report the performance for each source in addition to the overall result.

Baselines and Experiment Settings
Five recent zero pronoun resolution systems are employed as our baselines, namely, Zhao and Ng (2007), Chen and Ng (2015), Chen and Ng (2016),  and Yin et al. (2016). The first of them is machine learning-based, the second is the unsupervised and the other ones are all deep learning models. Since we concentrate on the anaphoric zero pronoun resolution process, we run experiments by employing the experiment setting with ground truth parse results and ground truth anaphoric zero pronoun, all of which are from the original dataset. Moreover, to illustrate the effectiveness of our reinforcement learning model, we run a set of ablation experiments by using different pretraining iterations and report the performance of our model with different iterations. Besides, to explore the randomness of the reinforcement learning technique, we report the performance variation of our model with different ran-dom seeds.

Implementation Details
We randomly initialize the parameters and minimize the objective function using Adagrad (Duchi et al., 2011). The embedding dimension is 100, and hidden layers are 256 and 512 dimensions, respectively. Moreover, the dropout (Hinton et al., 2012) regularization is added to the output of each layer.

Experiment Results
In Table 3, we compare the results of our model with baselines in the test dataset. Our reinforcement learning model surpasses all previous baselines. More specifically, for the "Overall" results, our model obtains a considerable improvement by 2.3% in F-score over the best baseline . Moreover, we run experiments in different sources of documents and report the results for each source. The number following a source's name indicates the amount of anaphoric zero pronoun in that source. Our model beats the best baseline in four of six sources, demonstrating the efficiency of our reinforcement learning model. The improvement gained over the best baseline in source "BC" is 4.3% in F-score, which is encouraging since it contains the most anaphoric zero pronoun. In all words, all these suggest that our model surpasses existed baselines, which demonstrates the efficiency of the proposed technique. Ideally, our model learns useful information gathered from candidates that have been predicted to be the antecedents in previous states, which brings a global-view instead of simply making partial decisions. By applying the reinforcement  learning, our model learns to directly optimize the overall performance in expectation, guiding benefit in making decisions in a sequential manner. Consequently, they bring benefit to predict accurate antecedents, leading to the better performance. Moreover, on purpose of better illustrating the effectiveness of the proposed reinforcement learning model, we run a set of experiments with different settings. In particular, we compare the model with and without the proposed reinforcement learning process using different pre-training iterations. For each time, we report the performance of our model on both the test and development set. For all these experiments, we retain the rest of the model unchanged. Figure 3: Experiment results of different models, where "RL" represents the reinforcement learning algorithm and "Pre" presents the model without reinforcement learning. "dev" shows the performance of our reinforcement learning model on the development dataset. Figure 3 shows the performance of our model with and without reinforcement learning. We can see from the table that our model with reinforcement learning achieves better performance than the model without this all across the board. With the help of reinforcement learning, our model learns to choose effective actions in sequential decisions. It empowers the model to directly optimize the overall evaluation metrics, which brings a more effective and natural way of dealing with the task. Moreover, by seeing that the performance on development dataset stops increasing with iterations bigger than 70, we therefore set the pretraining iterations to 70.
Following Reimers and Gurevych (2017), to illustrate the impact of randomness in our reinforcement learning model, we run our model with different random seed values. Table 4 shows the performance of our model with different random seeds on the test dataset. We report the minimum, the maximum, the median F-scores results and the standard deviation σ of F-scores. We run Min F Median F Max F σ 56.5 57.1 57.5 0.00253 Table 4: Performance of our model with different random seeds.
the model with 38 different random seeds. The maximum F-score is 57.5% and the minimum one is 56.5%. Based on this observation, we can draw the conclusion that our proposed reinforcement learning model generally beats the baselines and achieves the state-of-the-art performance.

Case Study
Lastly, we show a case to illustrate the effectiveness of our proposed model, as is shown in Figure 4. In this case, we can see that our model correctly predict mentions "那小穗/The Xiaohui" and "她/She" as the antecedents of the zero pronoun "φ". This case demonstrates the efficiency of our model. Instead of making only local decisions, our model learns to predict potential an- Figure 4: Example of case study. Noun phrases with pink background color are the ones selected to be the antecedents by our model.
tecedents incrementally, selecting global-optimal antecedents in a sequential manner. In the end, our model successfully predicts "她/She" as the result.
4 Related Work

Zero Pronoun Resolution
A wide variety of techniques for machine learning models for Chinese zero pronoun resolution have been proposed. Zhao and Ng (2007) utilized the decision tree to learn the anaphoric zero pronoun resolver by using syntactical and positional features. It is the first time that machine learning techniques are applied for this task. To better explore syntactics, Kong and Zhou (2010) employed the tree kernel technique in their model. Chen and Ng (2013) extended Zhao and Ng (2007)'s model further by integrating innovative features and coreference chains between zero pronoun as bridges to find antecedents. In contrast, unsupervised techniques have been proposed and shown their efficiency. Chen and Ng (2014) proposed an unsupervised model, where a model trained on manually resolved pronoun was employed for the resolution of zero pronoun. Chen and Ng (2015) proposed an unsupervised anaphoric zero pronoun resolver, using the salience model to deal with the issue. Besides, there has been extensive work on zero anaphora for other languages. Efforts for zero pronoun resolution fall into two major categories, namely, (1) heuristic techniques (Han, 2006); and (2) learning-based models (Iida and Poesio, 2011;Isozaki and Hirao, 2003;Iida et al., 2006Iida et al., , 2007Sasano and Kurohashi, 2011;Iida and Poesio, 2011;Yin, 2015;Iida et al., 2015Iida et al., , 2016. In recent years, deep learning techniques have been extensively studied for zero pronoun resolution. Chen and Ng (2016) introduced a deep neural network resolver for this task. In their work, zero pronoun and candidates are encoded by a feed-forward neural network.  explored to produce pseudo dataset for anaphoric zero pronoun resolution. They trained their deep learning model by adopting a two-step learning method that overcomes the discrepancy between the generated pseudo dataset and the real one. To better utilize vector-space semantics, Yin et al. (2016) employed recurrent neural network to encode zero pronoun and antecedents. In particular, a twolayer antecedent encoder was employed to generate the hierarchical representation of antecedents.  developed an innovative deep memory network resolver, where zero pronouns are encoded by its potential antecedent mentions and associated text.
The major difference between our model and existed techniques lies in the applying of deep reinforcement learning. In this work, we formulate the anaphoric zero pronoun resolution as a sequential decision process in a reinforcement learning setting. With the help of reinforcement learning, our resolver learns to classify mentions in a sequential manner, making global-optimal decisions. Consequently, our model learns to take advantage of earlier predicted antecedents when making later coreference decisions.

Deep Reinforcement Learning
Recent advances in deep reinforcement learning have shown promise results in a variety of natural language processing tasks (Branavan et al., 2012;Narasimhan et al., 2015;Li et al., 2016). In recent time, Clark and Manning (2016) proposed a deep reinforcement learning model for coreference resolution, where an agent is utilized for linking mentions to their potential antecedents. They utilized the policy gradient algorithm to train the model and achieves better results compared with the counterpart neural network model. Narasimhan et al. (2016) introduced a deep Q-learning based slot-filling technique, where the agent's action is to retrieve or reconcile content from a new document. Xiong et al. (2017) proposed an innovative reinforcement learning framework for learning multi-hop relational paths. Deep reinforcement learning is a natural choice for tasks that require making incremental decisions. By combining non-linear function approximations with reinforcement learning, the deep reinforcement learning paradigm can integrate vector-space semantic into a robust joint learning and reasoning process.
Moreover, by optimizing the policy-based on the reward signal, deep reinforcement learning model relies less on heuristic loss functions that require careful tuning.

Conclusion
We introduce a deep reinforcement learning framework for Chinese zero pronoun resolution. Our model learns the policy on selecting antecedents in a sequential manner, leveraging effective information provided by the earlier predicted antecedents. This strategy contributes to the predicting for later antecedents, bringing a natural view for the task. Experiments on the benchmark dataset show that our reinforcement learning model achieves an F-score of 67.2% on the test dataset, surpassing all the existed models by a considerable margin.
In the future, we plan to explore neural network models for efficaciously resolving anaphoric zero pronoun documents and research on some specific components which might influence the performance of the model, such as the embedding. Meanwhile, we plan to research on the possibility of applying adversarial learning (Goodfellow et al., 2014) to generate better rewards than the human-defined reward functions. Besides, to deal with the problematic scenario when ground truth parse tree and anaphoric zero pronoun are unavailable, we are interested in exploring the neural network model that integrates the anaphoric zero pronoun determination and anaphoric zero pronoun resolution jointly in a hierarchical architecture without using parser or anaphoric zero pronoun detector.