Chinese Zero Pronoun Resolution with Deep Memory Network

Existing approaches for Chinese zero pronoun resolution typically utilize only syntactical and lexical features while ignoring semantic information. The fundamental reason is that zero pronouns have no descriptive information, which brings difficulty in explicitly capturing their semantic similarities with antecedents. Meanwhile, representing zero pronouns is challenging since they are merely gaps that convey no actual content. In this paper, we address this issue by building a deep memory network that is capable of encoding zero pronouns into vector representations with information obtained from their contexts and potential antecedents. Consequently, our resolver takes advantage of semantic information by using these continuous distributed representations. Experiments on the OntoNotes 5.0 dataset show that the proposed memory network could substantially outperform the state-of-the-art systems in various experimental settings.


Introduction
A zero pronoun (ZP) is a gap in a sentence, which refers to an entity that supplies the necessary information for interpreting the gap (Zhao and Ng, 2007). A ZP can be either anaphoric if it corefers to one or more preceding noun phrases (antecedents) in the associated text, or non-anaphoric if there are no such noun phrases. Below is an example of ZPs and their antecedents, where "φ" denotes the ZP.
([The police] said that they are more likely to commit suicide, but φ 1 could not rule out φ 2 the possibility of homicide.) In this example, the ZP "φ 1 " is an anaphoric ZP that refers to the antecedent "警方/The police" while the ZP "φ 2 " is non-anaphoric. Unlike overt pronouns, ZPs lack grammatical attributes such as gender and number that have been proven to be essential in pronoun resolution (Chen and Ng, 2014a), which makes ZP resolution a more challenging task than overt pronoun resolution.
Automatic Chinese ZP resolution is typically composed of two steps, i.e., anaphoric zero pronoun (AZP) identification that identifies whether a ZP is anaphoric; and AZP resolution, which determines antecedents for AZPs. For AZP identification, state-of-the-art resolvers use machine learning algorithms to build AZP classifiers in a supervised manner Ng, 2013, 2016). For AZP resolution, literature approaches include unsupervised methods Ng, 2014b, 2015), feature-based supervised models (Zhao and Ng, 2007;Kong and Zhou, 2010), and neural network models (Chen and Ng, 2016). Neural network models for AZP resolution are of growing interest for their capacity to learn task-specific representations without extensive feature engineering and to effectively exploit lexical information for ZPs and their candidate antecedents in a more scalable manner than feature-based models.
Despite these advantages, existing supervised approaches (Zhao and Ng, 2007;Ng, 2013, 2016) for AZP resolution typically utilize only syntactical and lexical information through features. They overlook semantic information that is regarded as an important factor in the resolution of common noun phrases (Ng, 2007). The fundamental reason is that ZPs have no descriptive information, which results in difficulty in calculating semantic similarities and relatedness scores between the ZPs and their antecedents. Therefore, the proper representations of ZPs are required so as to take advantage of semantic information when resolving ZPs. However, representing ZPs is challenging because they are merely gaps that convey no actual content.
One straightforward method to address this issue is to represent ZPs with supplemental information provided by some available components, such as contexts and candidate antecedents. Motivated by Chen and Ng (2016) who encode a ZP's lexical contexts by utilizing its preceding word and governing verb, we notice that a ZP's context can help to describe the ZP itself. As an example of its usefulness, given the sentence "φ taste spicy", people may resolve the ZP "φ" to the candidate antecedent "red peppers", but can hardly regard "my shoes" as its antecedent, because they naturally look at the ZP's context "taste spicy" to resolve it ("my shoes" cannot "taste spicy"). Meanwhile, considering that the antecedents of a ZP provide the necessary information for interpreting the gap (ZP), it is a natural way to express a ZP by its potential antecedents. However, only some subsets of candidate antecedents are needed to represent a ZP 1 . To achieve this goal, a desirable solution should be capable of explicitly capturing the importance of each candidate antecedent and using them to build up the representation for the ZP.
In this paper, inspired by the recent success of computational models with attention mechanism and explicit memory (Sukhbaatar et al., 2015;Tang et al., 2016;Kumar et al., 2015), we focus on AZP resolution, proposing the zero pronounspecific memory network (ZPMN) that is competent for representing a ZP with information obtained from its contexts and candidate antecedents. These representations provide our system with an ability to take advantage of semantic information when resolving ZPs. Our ZPMN consists of multiple computational layers with shared parameters. With the underlying intuition that not all candidate antecedents are equally relevant for representing the ZP, we develop each computational layer as an attention-based model, which first learns the importance of each candidate antecedent and then utilizes this information to calculate the continu-1 A common way to do this task is to first extract a set of candidate antecedents, and then select antecedents from the candidate set. Therefore, only those candidates who are possibly the correct antecedent of the given ZP are suitable for interpreting it. ous distributed representation of the ZP. The attention weights over candidate antecedents with respect to the ZP's representation obtained by the last layer are regarded as the ZP coreference classification result. Given that every component is differentiable, the entire model could be efficiently trained end-to-end with gradient descent.
We evaluate our method on the Chinese portions of the OntoNotes 5.0 corpus by comparing with the baseline systems in different experimental settings. Results show that our approach significantly outperforms the baseline algorithms and achieves state-of-the-art performance.

Zero Pronoun-specific Memory Network
We describe our deep memory network approach for AZP resolution in this section. We first give an overview of our model and then describe its components. Finally, we present the training and initialization details.

An Overview of the Method
In this part, we present an overview of the zero pronoun-specific memory network (ZPMN) for AZP resolution. Given an AZP zp, we first extract a set of candidate antecedents. Following Chen and Ng (2016), we regard all and only those maximal or modifier noun phrases (NPs) that precede zp in the associated text and are at most two sentences away from it, to be its candidate antecedents. Suppose k candidate antecedents are extracted, our task is to determine the correct antecedent of zp from its candidate antecedent set A(zp) = {c 1 , c 2 , ..., c k }. Specifically, these candidate antecedents are represented in form of vectors {v c 1 , v c 2 , ..., v c k }, which are stacked and regarded as the external memory mem ∈ R l×k , where l is the dimension of v c . Meanwhile, we represent each word as a continuous and real-valued vector, which is known as word embedding (Bengio et al., 2003). These word vectors can be randomly initialized, or be pre-trained from text corpus with learning algorithms (Mikolov et al., 2013;Pennington et al., 2014). In this work, we adopt the latter strategy since it can better exploit the semantics of words. All the word vectors are stacked in a word embedding matrix L w ∈ R d×|V | , where d is the dimension of the word vector and |V | is the size of the word vocabulary. The embedding of word w is notated as e ∈ R d×1 , which is the column in L w .
An illustration of ZPMN is given in Figure 1, which is inspired by the memory network utilized in question answering (Sukhbaatar et al., 2015). Our model consists of multiple computational layers, each of which contains an attention layer and a linear layer. First, we represent the AZP zp by utilizing its contextual information, that is, proposing the ZP-centered LSTM that encodes zp into its distributed vector representation (i.e. v zp in Figure 1). We then regard v zp as the initial representation of zp, and feed it as the input to the first computational layer (hop 1). In the first computational layer, we calculate the attention weight across the AZP for each candidate antecedent, by which our model adaptively selects important information from the external memory (candidate antecedents). The output of the attention layer and the linear transformation of v zp are summed together as the input of to the next layer (hop 2).
We stack multiple hops by repeating the same process for multiple times in a similar manner. We call the abstractive information obtained from the external memory the "key extension" of the AZP. Note that the attention and linear layer parameters are shared in different hops. Regardless of the number of hops the model employs, they utilize the same number of parameters. Finally, after going through all the hops, we regard the attention weight of each candidate antecedent with respect to the AZP representation generated by the last hop as the probability that the candidate antecedent is the correct antecedent, and predict the highest-scoring (most probable) one to be the antecedent of the given AZP.

Modeling Zero Pronouns by Contexts
A vector representation of AZP is required when computing the ZPMN. As aforementioned, a ZP contains no actual content, it is therefore needed to employ some supplemental information to generate its initial representation. To achieve this goal, we develop the ZP-centered LSTM that encodes an AZP into a vector representation by utilizing its contextual information.
Admittedly, one efficient method to model a variable-length sequence of words (context words) is to utilize a recurrent neural network (Elman, 1991). A recurrent neural network (RNN) stores the sequence history in a real-valued history vector, which captures information of the whole sequence. LSTM (Hochreiter and Schmidhuber, 1997) is one of the classical variations of RNN that mitigate the gradient vanish problem of RNN. Assuming x = {x 1 , x 2 , ..., x n } is an input sequence, each time step t has an input x t and a hidden state h t . The internal mechanics of the LSTM is defined by: (1) where is an element-wise product and , and b (c) are the parameters of the LSTM network.
Intuitively, the words near an AZP generally contain richer information to express it. To bet- Figure 2: ZP-centered LSTM for encoding the AZP by its context words. w i means the i-th word in the sentence, w zp−i is the i-th last word before the ZP and w zp+i is the i-th word behind the ZP.
ter utilize the information of words surrounding the AZP, on the basis of the traditional LSTM, we propose the ZP-centered LSTM to encode the AZPs. A graphical representation of this model is displayed in Figure 2. Specifically, the ZPcentered LSTM contains two standard LSTM neural networks, i.e., the LSTM p that encodes the preceding context of the AZP in a left-to-right manner, and the LSTM f that models the following context in the reverse direction. Ideally, the ZPcentered LSTM models the preceding and following contexts of the AZP separately, so that the words near the AZP are regarded as the last hidden units and could contribute more in representing the AZP. Afterward, we obtain the representation of the AZP by concatenating the last hidden vectors of LSTM p and LSTM f , which summarizes the useful contextual information centered around the AZP. Averaging or summing the last hidden vectors of LSTM p and LSTM f could also be attempted as alternatives. We regard it as the initial vector representation of the AZP and feed it to the first computational layer to go through the remaining procedures of our system.

Generating the External Memory
We describe our method for generating the external memory in this subsection. For a given AZP, a set of noun phrases (NPs) is extracted as its candidate antecedents. Specifically, we generate the external memory by utilizing these candidate antecedents. One way to encode an NP candidate is to utilize its head word embedding (Chen and Ng, 2016). However, this method has a major drawback of not utilizing contextual information that is essential for representing a phrase. Besides, some approaches (Socher et al., 2013;Sun et al., 2015) encode a phrase by utilizing the average word embedding it contains. We argue that such an averaging operation simply treats all the words in a phrase equally, which is inaccurate because some words might be more informative than others.
A helpful property of LSTM is that it could keep useful history information in the memory cell by exploiting input, output and forget gates to decide how to utilize and update the memory of previous information. Given a sequence of words {w 1 , w 2 , ..., w n }, previous research (Sutskever et al., 2014) utilizes the last hidden vector of LSTM to represent the information of the whole sequence. For word w t in a sequence, its corresponding hidden vector h t can capture useful information before and including w t .  Figure 3: Illustration for modeling a candidate antecedent through its context and content words.
Candi represents the candidate antecedent. Suppose the candidate antecedent contains m words, w c[j] denotes its j-th word. w i is the i-th word in the sentence, and w c+1(−1) is the word appears immediately after (before) the candidate antecedent.
Inspired by this, we propose a novel method to produce representations of the candidate antecedents by utilizing both their contexts and content words. Specifically, we use the subtraction between LSTM hidden vectors to encode the candi-date antecedents, as illustrated in Figure 3. Given a candidate antecedent c with m words, two standard LSTM neural networks are employed for encoding c in the forward and backward direction, respectively. For the forward LSTM, we extract a sequence of words related with c in a left-to-right manner, i.e., {w 1 , w 2 , ..., w c−1 , w c This method enables our model to encode a candidate antecedent by the information both outside and inside the phrase, which provides our model a strong ability to access to sentence-level information when modeling the candidate antecedents. In this manner, we generate the vector representations of the candidate antecedents, and regard them as the external memory, i.e., mem = {v c 1 , v c 2 , ..., v c k }.

Attention Mechanism
In this part, we introduce our attention mechanism. This strategy has been widely used in many nature language processing tasks, such as factoid question answering , entailment (Rocktäschel et al., 2015) and disfluency detection (Wang et al., 2016). The basic idea of attention mechanism is that it assigns a weight/importance to each lower position when computing an upper-level representation (Bahdanau et al., 2015). With the underlying intuition that not all candidate antecedents are equally relevant for representing the AZP, we employ the attention mechanism as to dynamically align the more informative candidate antecedents from the external memory, mem = {v c 1 , v c 2 , ..., v c k } with regard to the given AZP, and use them to build up the representation of the AZP.
As shown in Chen and Ng (2016), traditional hand-crafted features are crucial for the resolver's success since they capture the syntactic, positional and other relationships between an AZP and its candidate antecedents. Therefore, to evaluate the importance of each candidate antecedent in a comprehensive manner, following Chen and Ng (2016) who encode hand-crafted features as inputs to their network, we integrate a set of features that are utilized in Chen and Ng (2016), in the form of vector (v (f eature) ) into our attention model. For each multi-valued feature, we convert it into a corresponding set of binary-valued features 2 . Specifically, for the t-th candidate antecedent in the memory, v ct , taking the vector representation of the AZP v zp and the corresponding feature vector v (f eature) t as inputs, we compute the attention score as α ). The scoring function G is defined by: where W (att) and b (att) are the attention parameters and k indicates the number of candidate antecedents. After obtaining the attention scores for all the candidate antecedents {a 1 , a 2 , ..., a k }, our attention layer outputs a continuous vector vec that is computed as the weighted sum of each piece of memory in mem:

Training Details
We initialize our word embeddings with 100 dimensional ones produced by the word2vec toolkit (Mikolov et al., 2013) on the Chinese portion of the training data from the OntoNotes 5.0 corpus. We randomly initialize the parameters from a uniform distribution U (−0.03, 0.03) and minimize the training objective using stochastic gradient descent with learning rate equals to 0.01. In addition, to regularize the network, we apply L2 regularization to the network weights and dropout with a rate of 0.5 on the output of each hidden layer.
The model is trained in a supervised manner by minimizing the cross-entropy error of ZP coreference classification. Suppose the training set contains N AZPs {zp 1 , zp 2 , ..., zp N }. Let A(zp i ) denote the set of candidate antecedents of an AZP zp i , and P (c|zp i ) represents the probability of predicting candidate c as the antecedent of zp i (i.e., the attention weight of candidate antecedent c with respect to the AZP representation generated by the last hop), the loss is given by: where δ(zp, c) is 1 or 0, indicating whether zp and c are coreferent.

Experimental Setup
Datasets: Following Ng (2016, 2015), we run experiments on the Chinese portion of the OntoNotes Release 5.0 dataset 3 used in the CoNLL 2012 Shared Task (Pradhan et al., 2012). The dataset consists of three parts, i.e., a training set, a development set and a test set. Since only the training set and the development set contain ZP coreference annotations, we train our model on the training set and utilize the development set for testing purposes. Meanwhile, we reserve 20% of the training set as a held-out development set for tuning the hyperparameters of our network. The same experimental data setting is utilized in the baseline system (Chen and Ng, 2016). Table 1 shows the statistics of our corpus. Besides, documents in the datasets come from six sources, i.e., broadcast news (BN), newswires (NW), broadcast conversations (BC), telephone conversations (TC), web blogs (WB) and magazines (MZ).  Evaluation metrics: Same as previous studies on Chinese ZP resolution (Zhao and Ng, 2007;Chen and Ng, 2016), we use three metrics to evaluate the quality of our model: recall, precision and F-score (denoted as R, P and F, respectively).
3 http://catalog.ldc.upenn.edu/LDC2013T19 Experimental settings: We employ three Chinese ZP resolution systems as our baselines, i.e., Zhao and Ng (2007); Ng (2015, 2016). Consistent with Ng (2015, 2016), three experimental settings are designed to evaluate our approach. In Setting 1, we directly employ the gold syntactic parse trees and gold AZPs that are obtained from the OntoNotes dataset. In Setting 2, we utilize gold syntactic parse trees and system (automatically identified) AZPs 4 . In Setting 3, we employ system AZP and system syntactic parse trees that obtained through the Berkeley parser 5 , which is the state-of-the-art parsing model. Table 2 shows the experimental results of the baseline systems and our model on entire test set. Our approach is abbreviated to ZPMN (k), where k indicates the number of hops. The best methods in each of the three experimental settings are in bold text. From Table 2, we can observe that our approach outperforms all previous baseline systems by a substantial margin. Meanwhile, among all our models from single hop to six hops, using more computational layers could generally lead to better performance. The best performance is achieved by the model with six hops under experimental Setting 1 and 2, and with four hops in experimental Setting 3. Furthermore, the ZPMN (with six hops) significantly outperforms the state-of-the-art baseline system (Chen and Ng, 2016) under three experimental settings by 2.7%, 2.7%, and 3.9% in terms of overall F-score 6 , respectively. In all words, our model is an extremely strong performer and substantially outperforms baseline methods, which demonstrate the efficiency of the proposed zero pronoun-specific memory network. It is well accepted that computational models that are composed of multiple processing layers could learn representations of data with multiple levels of abstraction (LeCun et al., 2015). In our approach, multiple computation layers allow the model to learn representations of AZPs with multiple levels of abstraction generated by candidate antecedents. Each layer/hop retrieves important candidate antecedents, and transforms the repre-   sentation at previous level into a representation at a higher, slightly more abstract level. We regard this representation as the "key extension" of the AZP, by which our model learns to encode the AZP in an efficient manner. For per-source results, we conduct experiments by comparing the ZPMN (with six hops) with the state-of-the-art baseline system (Chen and Ng, 2016) on six sources of test data, as shown in Table 3. The rows in Table 3 are the experimental results from different sources under the three experimental settings. In experimental Settings 1 and 3, ZPMN improves results further across all the six sources of data. Under experimental Setting 2, our model outperforms the baseline system in five of the six sources of data, only slightly underperforms in source TC. All these prove that our approach achieves a considerable improvement in Chinese ZP resolution.

Experimental Results
Moreover, to evaluate the effectiveness of our methods for modeling the AZP and candidate antecedents proposed in Section 2.2 and 2.3, we compare with three models that are all simplified versions of the ZPMN, namely, ZPCon-textFree where an AZP is initially represented by its governing verb and preceding word; AntCon-tentAvg where the candidate antecedents are encoded by their averaged content word embeddings; and AntContentHead where each candi-date antecedent is represented by the embedding of its head word. To make comparison as fair as possible, we keep the other parts of these models unchanged from the ZPMN with six computational layers (hop 6). To minimize the external influence, we run experiments under experimental Setting 1 (gold parse and gold AZPs). Table 4 shows the results.  With an intuition that contexts of an AZP provide more sufficient information than only a few specific of words in expressing the AZP, the performance of ZPContextFree is unsurprisingly worse than that of the ZPMN, which reflects the effects of the ZP-centered LSTM proposed to generate the initial representation for the AZP. In addition, the performance of AntContentAvg is relatively low. We attribute this to the model assigning the same importance to all the content words in a phrase, which causes difficulty for the model to capture informative words in a candidate antecedent. Meanwhile, AntContent-Head only models limited information when encoding candidate antecedents, thereby underperforms the ZPMN whose external memory contains sentence-level information both outside and inside the candidate antecedents. These demonstrate the utility of the method for modeling candidate antecedents.

Zero Pronoun Resolution
Chinese zero pronoun resolution. Early studies utilize heuristic rules to resolve ZPs in Chinese (Converse, 2006;Yeh and Chen, 2007). More recently, supervised approaches have been vastly explored. Zhao and Ng (2007) first present a machine learning approach to identify and resolve ZPs. By employing the J48 decision tree algorithm, various kinds of features are integrated into their model. Kong and Zhou (2010) develop a kernel-based approach, employing context-sensitive convolution tree kernels to model syntactic information. Chen and Ng (2013) further extend the study of Zhao and Ng (2007) by proposing several novel features and introducing the coreference links between ZPs. Despite the effectiveness of feature engineering, it is labor intensive and highly relies on annotated corpus. To handle these weaknesses, Chen and Ng (2014b) propose an unsupervised method. They first recover each ZP into ten overt pronouns and then apply a ranking model to rank the antecedents. Chen and Ng (2015) propose an end-to-end unsupervised probabilistic model, utilizing a salience model to capture discourse information. In recent years, Chen and Ng (2016) develop a deep neural network approach to learn useful task-specific representations and effectively exploit lexical features through word embeddings. Different from previous studies, in this work, we propose a novel memory network to perform the task. By encoding ZPs and candidate antecedents through the composition of texts based on the representation of words, our model benefits from the semantic information when resolving the ZPs.
Zero pronoun resolution for other languages. There have been various studies on ZP resolution for other languages besides Chinese. Ferrández and Peral (2000) propose a set of hand-crafted rules for resolving ZPs in Spanish texts. Recently, supervised approaches have been widely exploited for ZP resolution in Korean (Han, 2006), Italian (Iida and Poesio, 2011) and Japanese (Isozaki and Hirao, 2003;Iida et al., 2006Iida et al., , 2007Imamura et al., 2009;Sasano and Kurohashi, 2011;Iida and Poesio, 2011;Iida et al., 2015). Iida et al. (2016) propose a multi-column convolutional neural network for Japanese intra-sentential subject zero anaphora resolution, where both the surface word sequence and dependency tree of a target sentence are exploited as clues in their model.

Attention and Memory Network
Attention mechanisms have been widely used in many studies and have achieved promising performances on a variety of NLP tasks (Rocktäschel et al., 2015;Rush et al., 2015;. Recently, the memory network has been proposed and applied to question answering task (Weston et al., 2014), which is defined to have four compo-nents: input (I), generalization (G), output (O) and response (R). After then, memory networks have been adopted in many other NLP tasks, such as aspect sentiment classification (Tang et al., 2016), dialog systems (Dodge et al., 2015), and information extraction (Xiaocheng et al., 2017).

Conclusion
In this study, we propose a novel zero pronounspecific memory network that is capable of encoding zero pronouns into the vector representations with supplemental information obtained from their contexts and candidate antecedents. Consequently, these continuous distributed vectors provide our model with an ability to take advantage of the semantic information when resolving zero pronouns. We evaluate our method on the Chinese portion of OntoNotes 5.0 dataset and report substantial improvements over the state-ofthe-art systems in various experimental settings.