Transformer-GCRF: Recovering Chinese Dropped Pronouns with General Conditional Random Fields

Pronouns are often dropped in Chinese conversations and recovering the dropped pronouns is important for NLP applications such as Machine Translation. Existing approaches usually formulate this as a sequence labeling task of predicting whether there is a dropped pronoun before each token and its type. Each utterance is considered to be a sequence and labeled independently. Although these approaches have shown promise, labeling each utterance independently ignores the dependencies between pronouns in neighboring utterances. Modeling these dependencies is critical to improving the performance of dropped pronoun recovery. In this paper, we present a novel framework that combines the strength of Transformer network with General Conditional Random Fields (GCRF) to model the dependencies between pronouns in neighboring utterances. Results on three Chinese conversation datasets show that the Transformer-GCRF model outperforms the state-of-the-art dropped pronoun recovery models. Exploratory analysis also demonstrates that the GCRF did help to capture the dependencies between pronouns in neighboring utterances, thus contributes to the performance improvements.


Introduction
In pro-drop languages such as Chinese, pronouns can be dropped as the identity of the pronoun can be inferred from the context, and this happens more frequently in conversations (Yang et al., 2015). Recovering dropped pronouns (DPs) is a critical task for many NLP applications such as Machine Translation where the dropped pronouns need to be translated explicitly in the target language (Wang et al., 2016a(Wang et al., ,b, 2018. Recovering dropped pronoun is different from traditional pronoun resolution tasks (Zhao and Ng, 2007;Yin et al., 2017, * Corresponding author 2018), which aim to resolve the anaphoric pronouns to their antecedents. In dropped pronoun recovery, we consider both anaphoric and nonanaphoric pronouns, and we do not directly resolve the dropped pronoun to its antecedent, which is infeasible for non-anaphoric pronouns. We recover the dropped pronoun as one of 17 types pronouns pre-defined in (Yang et al., 2015), which include five types of abstract pronouns corresponding to the non-anaphoric pronouns. Thus traditional rulebased pronoun resolution methods are not suitable for recovering dropped pronouns. Existing approaches formulate dropped pronoun recovery as a sequence labeling task of predicting whether a pronoun has been dropped before each token and the type of the dropped pronoun. For example, Yang et al. (2015) first studied this problem in SMS data and utilized a Maximum Entropy classifier to recover dropped pronouns. Deep neural networks such as Multi-Layer Perceptrons (MLPs) and structured attention networks have also been used to tackle this problem (Zhang et al., 2016;. Giannella et al. (2017) used a linear-chain CRF to model the dependency be-tween the sequence of predictions in a utterance.
Although these models have achieved various degrees of success, they all assume that each utterance in a conversation should be labeled independently. This practice overlooks the dependencies between dropped pronouns in neighboring utterances, and results in sequences of predicted dropped pronouns are incompatible with one another. We illustrate this problem through an example in Figure 1, in which the dropped pronouns are shown in brackets. The pronoun can be dropped as a subject at the beginning of a utterance, or as an object in the middle of a utterance. Pronouns dropped at the beginning of consecutive utterances usually have strong dependencies that pattern with three types of dialogue transitions (i.e., Reply, Expansion and Acknowledgment) presented in (Xue et al., 2016). For example, in Figure 1, the pronoun in the second utterance B 1 is "我 (I)", the dropped pronoun in the third utterance B 2 should also be "我 (I)" since B 2 is an expansion of B 1 by the same speaker. Thus modeling the dependency between pronouns in adjacent sentences is helpful to recover pronoun dropped at utterance-initial positions. In contrast, the pronoun "他 (him)" dropped as an object in utterance B 4 should be recovered by capturing referent semantics from the context and modeling token dependencies in the same utterance.
To model the dependencies between predictions in the conversation snippet, we propose a novel framework called Transformer-GCRF that combines the strength of the Transformer model (Vaswani et al., 2017) in representation learning and the capacity of general Conditional Random Fields (GCRF) to model the dependencies between predictions. In the GCRF, a vertical chain is designed to capture the pronoun dependencies between the neighboring utterances, and horizontal chains are used for modeling the prediction dependencies inside each utterance. In this way, Transformer-GCRF successfully models the cross-utterance pronoun dependencies as well as the intra-utterance prediction dependencies simultaneously. Experimental results on three conversation datasets show that Transformer-GCRF significantly outperforms the state-of-the-art recovery models. We also conduct ablative experiments that demonstrate the improvement in performance of our Transformer-GCRF model derives both from the Transformer encoder and the ability of GCRF layer to model the dependencies between dropped pronouns in neighboring utterances. All code is available at https://github. com/ningningyang/Transformer-GCRF.
The major contributions of the paper are summarized as follows: • We conduct statistical study on pronouns dropped at the beginning of consecutive utterances in conversational corpus, and observe that modeling the dependencies between pronouns in neighboring utterances is important to improve the performance of dropped pronoun recovery. • We propose a novel Transformer-GCRF approach to model both intra-utterance dependencies between predictions in a utterance and cross-utterance dependencies between dropped pronouns in neighboring utterance.
The model jointly predicts all dropped pronouns in an entire conversation snippet. • We apply the Transformer-GCRF model on three conversation datasets. Results show that our Transformer-GCRF outperforms the baseline models on all datasets. Exploratory experiments also show that the improvement is attributed to the capacity of the model to capture cross-utterance dependencies.
2 Related Work 2.1 Dropped pronoun recovery As pronouns are frequently dropped in informal genres, Yang et al. (2015) first introduced dropped pronoun recovery as an independent task and used a Maximum Entropy classifier to recover DPs in text messages. Giannella et al. (2017) employed a linear-chain CRF to jointly predict the position, person, and number of the dropped pronouns in a single utterance, to exploit the sequential nature of this problem. With the powerful representation capability of neural network (Xu et al., 2020), Zhang et al. (2016) introduced a MLP neural network to recover the dropped pronouns based on the concatenation of word embeddings within a fixed-length window.  proposed a neural network with structured attention to model the interaction between dropped pronouns and their referents using both sentence-level and word-level context, and again each dropped pronoun is predicted independently.  further incorporated specific external knowledge to identify the referent more accurately. None of these methods consider

MLP Layer
Input Conversation Snippet

GCRF Layer Output Predictions
Emission Score Horizontal-chain Transition Vertical-chain Transition the dependencies between pronouns in neighboring utterances.

Zero pronoun resolution
Zero pronoun resolution (Zhao and Ng, 2007;Kong and Zhou, 2010;Chen and Ng, 2016;Yin et al., 2017Yin et al., , 2018) is a line of research closely related to dropped pronoun recovery. The difference between these two tasks is that zero pronoun resolution focuses on resolving anaphoric pronouns to their antecedents assuming the position of the dropped pronoun is already known. However, in dropped pronoun recovery, we consider both anaphoric and non-anaphoric pronouns, and attempt to recover the type of dropped pronoun but not its referent. Su et al. (2019) also presented a new utterance rewriting task which improves the multi-turn dialogue modeling through recovering missing information with coreference.

Conditional random fields
Conditional Random Fields (CRFs) are commonly used in sequence labeling. It models the conditional probability of a label sequence given a corresponding sequence of observations. Lafferty et al. (2001) made a first-order Markov assumption among labels and proposed a linear-chain structure that can be decoded efficiently with the Viterbi algorithm. Sutton et al. (2004) introduced dynamic CRFs to model the interactions between two tasks and jointly solve the two tasks when they are conditioned on the same observation. Zhu et al. (2005) introduced two-dimensional CRFs to model the dependency between neighborhoods on a 2D grid to extract object information from the web. Sut-ton et al. (2012) also explored how to generalize linear-chain CRFs to general graphs. CRFs have also been combined with powerful neural networks to tackle sequence labeling problems in NLP tasks such as POS tagging and Named Entity Recognition (NER) (Lample et al., 2016;Ma and Hovy, 2016;, but existing research has not explored how to combine deep neural networks with general CRFs.

Our Approach: Transformer-GCRF
We start by formalizing the dropped pronoun recovery task as follows. Given a Chinese conversation snippet X = (x 1 , · · · , x n ) which consists of n pro-drop utterances, where the i-th utterance x i = (x i1 , · · · , x im i ) is a sequence of m i tokens, and additionally given a set of k possible labels Y = {y 1 , · · · , y k−1 } ∪ {None} where each y j corresponds to a pre-defined pronoun (Yang et al., 2015) or 'None', which means no pronoun is dropped, the goal of our task is to assign a label y ∈ Y to each token in X to indicate whether a pronoun is dropped before this token and the type of pronoun. We model this task as the problem of maximizing the conditional probability p(Y|X), where Y is the label sequence assigned to the tokens in X. The conditional probability of a label assignment Y given the whole conversation snippet X can be written as: where s(X, Y) denotes score of the sequences of predictions in the conversation snippet. The denominator is known as partition function, and Y X contains all possible tag sequences for the conversation snippet X.

Overview of Transformer-GCRF
We score each pair of (X, Y) with our proposed Transformer-GCRF, as shown in Figure 2. When pre-processing the inputs, we attach a context to each pro-drop utterance x n in the snippet X. The context C n = {x n−5 , ...x n−1 , x n+1 , x n+2 } consists of the previous five utterances as well as the next two utterances following the practices in , and provides referent related contextual information to help recover the dropped pronouns. The representation layer uses the Transformer structure to encode the context C n and generates representations for tokens in utterance x n from the decoder. The prediction layer then utilizes a generalized CRF to model the cross-utterance and inter-utterance dependencies between the predictions in the conversation snippet, and outputs the predicted sequence for tokens in the snippet.

Representation layer
We employ the encoder-decoder structure of Transformer (Vaswani et al., 2017) to generate the representations for the tokens in pro-drop utterance x i and context C i separately.

Context encoder
The context encoder first unfolds all tokens in the context C i into a linear sequence as: (x i−5,1 , x i−5,2 , ..., x i+2,m i+2 ), and then inserts the delimiter '[SEP]' between each pair of utterances. Following the Transformer model (Vaswani et al., 2017), the input embedding of each token x k,l is the sum of its word embedding WE(x k,l ), position embedding POE(x k,l ), and speaker embedding PAE(x k,l ) as: The token embeddings E(x k,l ) are then fed into the encoder, which is a stack of L encoding blocks. Each block contains two sub-layers (i.e., a selfattention layer and a feed-forward layer) as: for l = 1, · · · , L, where 'FNN' and 'SelfATT' denotes the feed-forward and self-attention networks respectively, and In Equation 1, the self-attention layer first projects the input as a query matrix (H

).
A multi-head attention mechanism is then applied to these three matrices to encode the input tokens in the context.

Utterance decoder
To generate the representations for tokens in the pro-drop utterance x i and exploit referent information from its context C i , we utilize the decoder component of the Transformer to represent x i . Similar to the context encoder, the inputs to the utterance decoder are the embeddings of the tokens. Each embedding E(x i,j ) is also a sum of its word embedding, position embedding, and speaker embedding. Then, the input to the decoder, denoted as S (0) , is a concatenation of all the token embeddings: The decoder is still a stack of L decoding blocks. Each decoding block Dec(·) contains three sublayers (i.e., a self-attention layer, an interaction attention layer, and a feed-forward layer) as: for l = 1, · · · , L, where FFN is a feed-forward network, SelfATT is a self-attention network.
Finally, the output states of the decoder S (L) are transformed into logits through a two-layer MLP network as: where the logits matrix P of size n × m × k will be fed into a subsequent prediction layer. k is the number of distinct tags, and each element P i,j,l refers to the emission score of the l-th tag of the j-th word in the i-th utterance.

GCRF layer
We utilize an elaborately designed general conditional random fields (GCRF) layer to recover dropped pronouns by modeling cross-utterance and intra-utterance dependencies between dropped pronouns.
Step 1 constructs a initial graph. The tokens in each utterance are shown and the nodes corresponding to the first token in each utterance are highlighted in red; step 2-1 processes an OVP (in the second utterance) and adds an observed (shaded) node for token "我/(I)"; step 2-2 processes an interjection (in the third utterance) and skips the node corresponding to the token "哈哈/(Aha)".

Graph construction in GCRF
Given a conversation snippet, a graph is constructed where each node, corresponding to a token, is a random variable y that represents the type of the pronoun defined in Y. The edges in the graph are defined by the following two steps: Step 1: Initial graph construction: We first split each compound utterance into several simple utterances by punctuation, and connect the nodes corresponding to the tokens in the same simple utterance with horizontal edges to model intra-utterance dependencies. Then we link the first tokens in consecutive utterances with a vertical chain to model the cross-utterance dependencies.
Step 1 in Figure 3 shows an initial graph for a conversation snippet.
Step 2: Vertical edge refinement: Though the vertical chain constructed in Step 1 can capture most of the cross-utterance dependencies, they can be further refined considering the following two general cases in conversation: • Overt pronouns (OVP): If an OVP appears as the first token in a utterance, it is clear that there is a dependency between the OVP and the dropped pronoun in neighboring utterances. To model this phenomenon, an observed node (with the value of its pronoun type) is inserted in the graph, and the vertical chain linked to the original node is moved to this new node.
Step 2-1 in Figure 3 shows the refined graph after OVPs are processed. • Interjections: If the first token in an utterance is an interjection (e.g., "嗯/ Well", "哈 哈/ Aha" etc.), it is better to skip the utterance in the vertical chain because the short utterance consisting of only interjections and punctuation does not provide useful information about the dependencies between pronouns.
Step 2-2 in Figure 3 shows the refined graph after interjections are processed.

Pronoun prediction
It is obvious that the GCRF is a special case of the 2D CRFs. To predict the labels of the nodes following the practices in (Zhu et al., 2005), we employ a modified Viterbi algorithm in which the nodes in the vertical chain are decoded first. Specifically, the constructed graph consists of two types of cliques: one from the horizontal chains and the other from the vertical chain. Given the emission score matrix P outputted from the decoder layer (see Section 3.2.1), the joint score s(X, Y) of the predictions can be computed by first computing the sum of horizontal chains and then summing up scores of the transitions in the vertical chain as: where A (1) and A (2) are the transition matrices of the horizontal chains and the vertical chain, respectively; A i,j indicates the transition score from tag i to tag j; and the node T i is defined as, where y OVP ∈ Y is the observed label corresponds to the specific OVP. The first term in Eq. (3) is the score corresponding to the horizontal chain cliques, and the second term corresponds to the vertical chain clique.

Decoding the GCRF and Model training
The sequence that maximizes the conditional probability p(Y|X) is outputted as the prediction: A modified Viterbi algorithm is used to find the best labeling sequence. Specifically, we first applies the Viterbi algorithm to decode the vertical chain. Then, the vertical chain decoding results are used as the observed nodes in the graph, and the standard Viterbi algorithm is applied to each horizontal chain in parallel. Algorithm 1 shows the Transformer-GCRF decoding process.
Given a set of labeled conversation snippets D, the model parameters are learned by jointly maximizing the overall log-probabilities of the groundtruth label sequences: max (X,Y)∈D log(p(Y|X)).

Datasets and Experimental Setup
Datasets: We evaluate the performance of Transformer-GCRF on three conversation benchmarks: Chinese text message dataset (SMS), OntoNotes Release 5.0, and BaiduZhidao. The SMS dataset is described in (Yang et al., 2015) and contains 684 text message documents generated by users via SMS or Chat. Following (Yang et al., 2015, we reserved 16.7% of the training set as the development set, and a separate test set was used to evaluate the models. The OntoNotes Release 5.0 was released in the CoNLL 2012 Shared Task. We used the TC section which consists of transcripts of Chinese telephone conversation speech. The BaiduZhidao dataset is a question answering dialogue corpus collected by (Zhang et al., 2016). Ten types of dropped pronouns are annotated according to the pronoun annotation guidelines. The statistics of these three benchmarks are reported in Table 1.
Baselines: State-of-the-art dropped pronoun recovery models are used as baselines: (1) MEPR (Yang et al., 2015) which leverages a set of elaborately designed features and trains a Maximum Entropy classifier to predict the type of dropped pronoun before each token; (2) NRM (Zhang et al., 2016) which employs two separate MLPs to predict the position and type of a dropped pronoun utilizing representation of words in a fixed-length window; (3) BiGRU which utilizes a bidirectional RNN to encode each token in a pro-drop sentence and makes prediction based on the encoded states; (4) NDPR  which models dropped pronoun referents by attending to the context and independently predicts the presence and type of DP for each token.
We also compare three variants of Transformer-GCRF as: (1) Transformer-GCRF(w/o refine) which removes Step 2 in Section 3.3.1 during the graph construction process, for exploring the effectiveness of processing OVP and interjections; (2) Transformer which removes the whole GCRF layer that globally optimizes the prediction sequences, and directly adds a MLP layer on the top of Transformer encoder to predict the dropped pronouns. It aims to explore the contribution of Transformer encoder among the total effectiveness of Transformer-

GCRF;
(3) NDPR-GCRF which replaces the Transformer structure in the presentation layer with the NDPR model .
Training details: In all of the experiments, a vocabulary was first generated based on the entire dataset, and the out-of-vocabulary words are represented as "UNK". The length of utterances in a conversation snippet is set as 8 in our work. In Transformer-GCRF, both the encoder and decoder in the Transformer have 512 units in each hidden layer. We augment each utterance with a context consisting of seven neighboring utterances according to the practice in . In each experiment, we trained the model for 30 epochs on one GPU, which took more than five hours, and the model with the highest F-score on the development set was selected for testing. Following (Glorot and Bengio, 2010)

Performance Evaluation
We apply our Transformer-GCRF model to all three conversation datasets to demonstrate the effectiveness of the model. Table 2 reports the results of our Transformer-GCRF model as well as the baseline models in terms of precision (P), recall (R), and F-score (F).
From the results, we can see that our proposed model and its variants outperformed the baselines on all datasets. The best model Transformer-GCRF achieves a gain of 2.58% average absolute improvement across all three datasets in terms of F-score. We also conducted significance tests on all three datasets in terms of F-score. The results show that our method significantly outperforms the best baseline NDPR (p < 0.05). The proposed Transformer-GCRF suffers from performance degradation when Step 2 is removed from the graph construction process (i.e., referring to the results of Transformer-GCRF(w/o refine) in Table 2), which demonstrates the important role of OVPs in modeling dependencies between different utterances, and the contribution of noise reduction resulting from skipping short utterances starting with interjections. Both our proposed Transformer-GCRF model and the variant Transformer-GCRF(w/o refine) model outperform the variant Transformer, which demonstrates that the effectiveness comes from not only the powerful Transformer encoder, but also the elaborately designed GCRF layer. Moreover, the variant NDPR-GCRF, which encodes the pro-drop utterances with BiGRU as NDPR , still outperforms the original NDPR. This shows that the proposed GCRF is effective in modeling cross-utterance dependencies regardless of the underlying representation.

Motivation by statistical results
The GCRF model is motivated with a quantitative analysis of our data, which shows that 79.6% of the dropped pronouns serve as the subject of a sentence, and occur at utterance-initial positions. The pronouns dropped at the beginning of consecutive utterances are strongly correlated with dialogue patterns and thus modeling conversational structures  Figure 4: Visualization of the transition weight between each pair of pronouns among 16 types of predefined pronouns (i.e., except the category 'None'), obtained from the vertical chain transition matrix A (2) . Darker color indicates higher transition weight between these two types of pronouns.
helps improve recover dropped pronouns. Other pronouns dropped as objects in the middle of a utterance should be recovered by modeling intrautterance dependencies.
To further explore the cross-utterance pronoun dependencies, we collected all pronoun pairs occurring at the beginning of consecutive utterances and classified the dependencies into one of the three dialogue transitions defined in (Xue et al., 2016). We found that 27.33% of the pairs correspond to reply transition, where the second utterance is a response to the first utterance, and 18.60% of pairs correspond to the acknowledgment transition, where the second utterance is an acknowledgment of the first utterance. In both cases, the utterances involve a shift of speaker, which is accompanied by a shift in the use of personal pronouns. Another 47.79% of the pairs correspond to the expansion transition, where the second utterance is an elaboration of the first utterance and the same pronoun is used.

Visualizing transition matrix of GCRF
To investigate whether our GCRF model actually learned the dependencies revealed by the quantitative analysis of our corpus, we visualize the transition matrix A (2) of the vertical chain in Figure 4. We can see that the learned transition matrix matches well with the distribution of dialogue patterns. The matrix shows that the higher transition weights on diagonal correspond to the strong expansion transition in which the same pronoun is used in consecutive utterances and the transition weights between "我(I)" and "你(you)" (top-left corner) are high as well, indicating the strong reply transition. Moreover, the acknowledgement transition usually exists from the pronoun "previous utterance" to "我(I)" or "你(you)".

Case studies
We demonstrate the effectiveness of GCRF by comparing the outputs of NDPR and NDPR-GCRF on the entire test set, and present some concrete cases in Figure 5. The examples show that the horizontal chains in GCRF contributes by preventing redundant predictions in the same utterance. For example, in the first case, the second pronoun "你(you)" is repeatedly recovered by NDPR since the dependency between the predictions of the first two tokens is ignored. The vertical chain contributes by predicting coherent dropped pronouns at the beginning of the utterances. For example, in the second case, the second utterance is a reply of the first one, and NDPR-GCRF recovers these two pronouns correctly by considering their dependency.

Effects of the Transformer architecture
We further study the effectiveness of multi-head attention in Transformer structure. Figure 6 shows an example conversation snippet with three utterances and the pronoun "它(it)" in the last utterance is dropped. The Transformer's attention weights corresponding to three heads which are shown in blue, and the NDPR's attention weights are shown in brown. From the results, we can see that "head 1" is responsible for associating "股票(stock)" with "它(it)" (in utterance A 1 ), "head 2" is responsible for associating "它(it)" with "它(it)", and "head  3" is responsible for collecting noisy information, which is helpful for the training process (Michel et al., 2019;Correia et al., 2019). This is consistent with the observation in (Vig, 2019) that multi-head attention is powerful because it uses different heads to capture different relations. NDPR, on the other hand, captures all these the relations with a single attention structure. The results explain why Transformer is suitable for dropped pronoun recovery.

Error Analysis
Besides conducting the performance evaluation and analyzing the effects of different components, we also investigate some typical mistakes made by our Transformer-GCRF model. The task of recovering dropped pronouns consists of first identifying the referent of each dropped pronoun from the context and then recovering the referent as a concrete Chinese pronoun based on the referent semantics. Existing work has focused on modeling referent semantics of the dropped pronoun from context, and globally optimizing the prediction sequences by exploring label dependencies. However, there is also something need to do about how to recover the referent as a proper pronoun based on the referent semantics. For example, in two cases of Figure 7, the referents of the dropped pronouns are correctly identified, while the final pronoun was recovered as "(他们/they)" and "(它/it)" by mistake. We attribute this to that the model needs to be augmented with some common knowledge about how to recover a referent to the proper Chinese pronoun.

Conclusion and Future Work
In this paper, we presented a novel model for recovering the dropped pronouns in Chinese conversations. The model, referred to as Transformer-GCRF, formulates dropped pronoun recovery as a sequence labeling problem. Transformer is employed to represent the utterances and GCRF is used to make the final predictions, through capturing both cross-utterance and intra-utterance dependencies between pronouns. Experimental results on three Chinese conversational datasets show that Transformer-GCRF consistently outperforms stateof-the-art baselines.
In the future, we will do some extrinsic evaluation by applying our proposed model in some downstream applications like pronoun resolution, to further explore the effectiveness of modeling cross-utterance dependencies in practical applications.