ZPR2: Joint Zero Pronoun Recovery and Resolution using Multi-Task Learning and BERT

Zero pronoun recovery and resolution aim at recovering the dropped pronoun and pointing out its anaphoric mentions, respectively. We propose to better explore their interaction by solving both tasks together, while the previous work treats them separately. For zero pronoun resolution, we study this task in a more realistic setting, where no parsing trees or only automatic trees are available, while most previous work assumes gold trees. Experiments on two benchmarks show that joint modeling significantly outperforms our baseline that already beats the previous state of the arts.


Introduction
Zero pronoun (ZP) is a linguistic phenomenon where a pronoun is dropped for simplicity. Figure 1 shows an example, where two pronouns at positions 1 and 2 are omitted. They both refer to "fπ (The police)" in the sentence beginning and their original form is "÷Ï (they)". The situation of dropping pronouns happens in most languages. While this phenomenon is not frequent in non-pro-drop languages, such as English, it is extremely severe for pro-drop languages, such as Chinese. In addition, dropped pronouns happens more frequently in conversations than in news. Our preliminary statistics of Chinese shows that 59.2% pronouns are dropped in a corpus of casual dialogues domain, while the number is just 41.6% in another data of broadcast news.
In NLP, dropped pronouns can cause loss of important information, such as the subject or object of the central predicate in a sentence, introducing ambiguity to applications such as machine translation (Nakaiwa and Shirai, 1996;Wang et al., 2016;Takeno et al., 2016), question answering (Choi et al., 2018;Reddy et al., 2019;Sun et al., 2019; [ fπ ] ë Ÿ / w -™ Hˆ 1 ⌃ ™∞ å AE⇧ § ⇥Ã 2 Â ⇧⌃ H≈ ⇥ [ The police ] suspected that this is a criminal case about illegal guns , 1 brought the guns and bags to the city 2 to deal with the case . Figure 1: An zero pronoun example and its English translation, where 1 and 2 are zero pronouns pointing to the span in square brackets. Chen and Choi, 2016) and dialogue understanding (Chen et al., 2017;Rolih, 2018). As a result, zero pronouns have recently received much research attention Yin et al., 2018a,b). We study Chinese zero pronoun in dialogue settings.
There are two long-existing tasks namely zero pronoun recovery, which aims at recovering the original pronoun (such as "÷ (he)" and "y (she)"), and zero pronoun resolution, where the goal is to pinpoint the mention that each dropped pronoun refers to. Intuitively, the results of the two tasks highly interact with each other. Taking Figure 1 as an example, it will be much easier to resolute 1 to "fπ (The police)" rather than "-™ Hˆ(criminal case about illegal guns)" if we know 1 corresponds to "÷Ï (they)". Similarly, it would be more likely to recover 1 as "÷Ï (they)" than other candidate pronouns, if we know 1 points to "fπ (The police)". Despite their high correlation, previous work considers them as irrelevant tasks, solving them separately by different models. This can waste training resources, as each task has a limited number of labeled instances, and thus data sparsity can limit model performance. Besides, we believe that it is unnecessary to keep a specific model for each task, as they can be close enough to be solved together. In addition, most zero pronoun resolution research Ng, 2013, 2016;Kong and Zhou, 2010;Iida and Poesio, 2011;Sasano et al., 2008;Yin et al., 2018b;Yang et al., 2019) as-sumes gold trees being available with the positions of zero pronouns, which is unrealistic in practical applications. During decoding, a zero pronoun resolution model has to rely on automatic trees and zero pronoun detection, thus suffering from error propagation.
In this paper, we propose to jointly solve both tasks under a heterogeneous multi-task learning framework, where each data point only has the annotation of one task, to benefit from the supervised data of both tasks. As the result, we enjoy the benefit of more supervised training data. To improve the robustness of heterogeneous training and introduce more supervision, we introduce zero pronoun detection, a common sub-task for both ZP resolution and recovery. Zero pronoun detection is a binaryclassification task aiming to detect whether a word space has a dropped pronoun.
We consider ZP recovery as a sequence labeling task, regarding whether each word space has a dropped pronoun and what type the pronoun is. ZP resolution is solved as extractive reading comprehension (Rajpurkar et al., 2016), where each word space is taken as a query and its anaphoric mentions are treated as the answers. For non-ZP spaces where there is no corresponding anaphoric mentions, we assign the sentence beginning (span [0,0]) as the answer.
Experiments on two benchmarks, OntoNotes 5.0 1 (ZP resolution) and BaiduZhdiao  (ZP recovery), show that joint modeling gives us 1.5+ absolute F1-score gains for both tasks over our very strong baselines using BERT (Devlin et al., 2019). Our overall system gives an dramatic improvement of 3.5 F1 points over previous stateof-the-art results on both tasks.

Related work
Previous work considers zero pronoun resolution and recovery separately. For zero pronoun recovery, existing methods can be classified according to the types of annotations they use. One line of work (Yang et al., 2015(Yang et al., , 2019 simply relies on the human annotations, solving the task as sequence labeling. The other line of work (Chung and Gildea, 2010;Xiang et al., 2013;Wang et al., 2016) mines weak supervision signals from a large bilingual parallel corpus, where the other language is non-prodrop with fewer pronoun drops. The latter requires massive training data, and the MT performance is the primary goal, thus we follow the first line of research using human-annotated data. Rao et al. (2015) studied zero pronoun resolution in multi-turn dialogues, claiming that their model does not rely on parsing trees to extract ZP positions and noun phrase as resolution candidates. However, they only consider the dropped pronouns that correspond to one of the dialogue participant. As a result, they only explore a small subset of the entire ZP resolution problem, and their task is closer to zero pronoun recovery. Most similar to our work,  converted zero pronoun resolution as a machine reading comprehension task (Rajpurkar et al., 2016) in order to automatically construct a large-scale pseudo dataset for model pretraining. However, their model finetuning and evaluation with benchmark data still rely on human-annotated trees and gold zero pronoun positions. As a result, it is still uncertain what performance a model can achieve without such gold inputs. We address both issues in the joint task.
Our work is inspired by the recent advances of heterogeneous multi-task learning using BERT (Devlin et al., 2019), which combines the supervised data of several related tasks to achieve further improvements. In particular, Liu et al. (2019) utilize this framework to jointly solve GLUE tasks (Wang et al., 2019). But their experiments show that multitask learning does not help across all tasks. Our work takes a similar spirit, and our contribution is mainly on the zero pronoun tasks. In addition, we find that it helps the robustness of multi-task learning to add a common sub-task (e.g. zero pronoun detection in our case) for additional supervision and alleviating annotation variances, if such a subtask is available.

Model
As shown in Figure 2, we model ZP recovery (f rec ), ZP resolution (f res ), and the auxiliary ZP detection (f det ) task with multi-task learning, where BERT (Devlin et al., 2019) is used to represent each input sentence s 1 . . . s N of N words to provide shared features.

Zero pronoun recovery
ZP recovery is to restore any dropped pronouns for an input text. Since pronouns are enumerable (e.g. there are 10 types for Chinese), we cast this task into a classification problem for each word space. Taking some shared input representations  h 0 , h 1 , . . . , h N , the probability for recovering pronoun p i at the space between s i 1 and s i is: where W r and b r are model parameters.

Zero pronoun resolution
Our zero pronoun resolution task is to predict the span that each dropped pronoun points to, while the gold ZP positions are not available. One potential solution is executing zero pronoun recovery first and utilize that information, while this introduces error propagation. Conversely, we manually assign span "(0,0)" for non-ZP positions. This will not introduce conflicts, as position "0" corresponds to the special token [CLS] for BERT encoding and thus no real spans can be "(0,0)". We cast the resolution task for each word space (such as between s i 1 and s i ) as machine reading comprehension (MRC) (Rajpurkar et al., 2016), where a resolution span corresponds to a MRC target answer. Following previous work on MRC, we separately model the start (r st i ) and end (r ed i ) positions for each span with self-attention: where H = [h 0 , . . . , h N ] is the concatenation of all word states, and SelfAttn st () and SelfAttn ed () are the self-attention modules for predicting the start and end positions of each ZP resolution span. The probability for the whole span r i is:

Auxiliary task: zero pronoun detection
We also introduce pronoun detection as an auxiliary task to enhance multi-task training. This task is to determine whether each word space has a dropped pronoun. Similar with zero pronoun recovery, we formulate it as binary classification: where d i is the binary detection result. W d and b d are model parameters.

Encoding input with BERT
Given an input sentence s 1 , . . . , s N , we use BERT to encode them into a sequence of input features shared across all our tasks. We append the [CLS] token to inputs, before sending them to BERT. Our task features are represented as h 0 , h 1 , . . . , h N , where h 0 corresponds to token [CLS].

Training
We train our model on the combined and shuffled data of both tasks to leverage more supervision signals. Each data instance only contains the annotation of either ZP recovery or resolution, thus the loss for one example is defined as: where ↵, and are the coefficients for the tasks. For ↵ and , the value of is 1 if the corresponding supervision exists, otherwise it is 0. We empirically set the value of to 0.1, as the supervision of ZP detection exists for all instances, and we do not want this auxiliary loss signal to be too strong.

Experiments
We study the effectiveness of jointly modeling ZP resolution, recovery and detection.

Data and setting
We take two benchmark datasets: BaiduZhidao , a benchmark for ZP recovery, and OntoNotes 5.0, a benchmark for ZP resolution. For BaiduZhidao, we use the version cleaned by Yang et al. (2019), containing 5504, 1175 and 1178 instances for training, development and testing, respectively. OntoNotes 5.0 has 36487 training and 6083 testing instances, and we separate 20% training instances for development.   We choose the official pretrained Chinese BERTbase model 2 . Models are trained with Adam (Kingma and Ba, 2014) with a learning rate of 10 5 and a warm-up proportion of 10%. To avoid overfitting, we apply l2 norm for BERT parameters with a coefficient of 0.01. Models are selected by early stopping with development results. Table 1 shows the results for both resolution and recovery tasks, where ZPMN and NDPR-W show the state-of-the-art performances without relying on any gold syntactic information. ZPMN treats zero pronoun resolution as a classification task over noun phrase candidates, and the final result is selected using an attention mechanism. NDPR-W studies zero pronoun recovery in dialogues by modeling all dialogue history.

Main results
For our models, BERT represents finetuning BERT only on one task, BERT-MTL means jointly finetuning BERT on both tasks with multi-task learning (as shown in Figure 2), and BERT-MTL w/ detection is our model with auxiliary detection loss. Using BERT already gives us much better performances than the previous state-of-the-art results. Initial usage of heterogeneous multi-task learning helps ZP resolution, while hurting ZP recovery, 2 https://github.com/google-research/bert and one potential reason is that the ZP resolution dataset (OntoNotes 5.0) has much more instances than the ZP recovery dataset (BaiduZhidao). This problem is alleviated by introducing the auxiliary ZP detection task due to the following possible reasons. Most importantly, ZP detection is very close to ZP recovery (binary vs multi-class), thus this extra supervision helps to alleviate the data magnitude imbalance problem. Besides, ZP detection introduces more useful training signals to the overall training process.

More analysis on ZP resolution
We also evaluate on other previously studied settings, where gold trees or even gold ZP positions are given. As ZPMN also reported strong performances cross these settings, we take this model as a baseline for comparison.
Using gold trees and ZP positions Since most previous work on ZP resolution uses gold syntactic trees and/or ZP positions, we also investigate our performance under these settings. In particular, we take the noun phrases and/or ZP positions from gold trees to serve as constraints. Besides, our model is only trained on the ZP positions when they are given. Table 2 shows the results, AttentionZP gives the previous state-of-the-art performance under the Gold Tree + Gold ZP setting. Our model outperforms AttentionZP by a significant margin. Beside, we also report the best performance, which significantly outperforms the previous best system (ZPMN) under the Gold Tree + Auto ZP setting, where only gold trees are available.
Effectiveness of automatic trees Currently, our model considers all free spans when making a resolution decision. Using automatic tree can greatly limit the search space, while that could introduce errors. We conduct a preliminary comparison as shown in Table 3, where such a constraint dramatically helps the performance. But, this is based on the assumption that the target-domain syntactic parsing is very accurate, as our ZP resolution data (OntoNotes 5.0) is mostly collected from broadcast news. The F1 score using automatic trees (34.12) is close to the score using gold trees (36.55 in Table 2), which also indicates the conjecture above. As a result, we may expect a performance drop for web and biomedical domains, where the parsing accuracies are much lower.

Conclusion
We studied the effectiveness of jointly modeling ZP recovery and resolution using the recently introduced multi-task learning + BERT framework. To alleviate the data magnitude imbalance problem, we introduce ZP detection as a common auxiliary sub-task for extra supervision. Experiments on two benchmarks show that our model is consistently better than previous results under various settings, and that the auxiliary ZP detection sub-task can make the training process more robust.