A Novel Approach to Dropped Pronoun Translation

Dropped Pronouns (DP) in which pronouns are frequently dropped in the source language but should be retained in the target language are challenge in machine translation. In response to this problem, we propose a semi-supervised approach to recall possibly missing pronouns in the translation. Firstly, we build training data for DP generation in which the DPs are automatically labelled according to the alignment information from a parallel corpus. Secondly, we build a deep learning-based DP generator for input sentences in decoding when no corresponding references exist. More specifically, the generation is two-phase: (1) DP position detection, which is modeled as a sequential labelling task with recurrent neural networks; and (2) DP prediction, which employs a multilayer perceptron with rich features. Finally, we integrate the above outputs into our translation system to recall missing pronouns by both extracting rules from the DP-labelled training data and translating the DP-generated input sentences. Experimental results show that our approach achieves a significant improvement of 1.58 BLEU points in translation performance with 66% F-score for DP generation accuracy.


Introduction
In pro-drop languages, certain classes of pronouns can be omitted to make the sentence compact yet comprehensible when the identity of the pronouns can be inferred from the context (Yang et al., 2015). Figure 1 shows an example, in which Chinese is a pro-drop language (Huang, 1984), while English is not (Haspelmath, 2001). On the Chinese side, the subject pronouns {你 (you), 我 (I)} and the object pronouns {它 (it), 你 (you)} are omitted in the dialogue between Speakers A and B. These omissions may not be problems for humans since people can easily recall the missing pronouns from the context. However, this poses difficulties for Statistical Machine Translation (SMT) from pro-drop languages (e.g. Chinese) to non-pro-drop languages (e.g. English), since translation of such missing pronouns cannot be normally reproduced. Generally, this phenomenon is more common in informal genres such as dialogues and conversations than others (Yang et al., 2015). We also validated this finding by analysing a large Chinese-English dialogue corpus which consists of 1M sentence pairs extracted from movie and TV episode subtitles. We found that there are 6.5M Chinese pronouns and 9.4M English pronouns, which shows that more than 2.9 million Chinese pronouns are missing.
In response to this problem, we propose to find a general and replicable way to improve translation quality. The main challenge of this research is that training data for DP generation are scarce. Most works either apply manual annotation (Yang et al., 2015) or use existing but small-scale resources such as the Penn Treebank (Chung and Gildea, 2010;Xiang et al., 2013). In contrast, we employ an unsupervised approach to automatically build a largescale training corpus for DP generation using alignment information from parallel corpora. The idea is that parallel corpora available in SMT can be used to project the missing pronouns from the target side (i.e. non-pro-drop language) to the source side (i.e. pro-drop language). To this end, we propose a simple but effective method: a bi-directional search algorithm with Language Model (LM) scoring.
After building the training data for DP generation, we apply a supervised approach to build our DP generator. We divide the DP generation task into two phases: DP detection (from which position a pronoun is dropped), and DP prediction (which pronoun is dropped). Due to the powerful capacity of feature learning and representation learning, we model the DP detection problem as sequential labelling with Recurrent Neural Networks (RNNs) and model the prediction problem as classification with Multi-Layer Perceptron (MLP) using features at various levels: from lexical, through contextual, to syntax.
Finally, we try to improve the translation of missing pronouns by explicitly recalling DPs for both parallel data and monolingual input sentences. More specifically, we extract an additional rule table from the DP-inserted parallel corpus to produce a "pronoun-complete" translation model. In addition, we pre-process the input sentences by inserting possible DPs via the DP generation model. This makes the input sentences more consistent with the additional pronoun-complete rule table. To alleviate the propagation of DP prediction errors, we feed the translation system N -best prediction results via confusion network decoding (Rosti et al., 2007).
To validate the effect of the proposed approach, we carried out experiments on a Chinese-English translation task. Experimental results on a largescale subtitle corpus show that our approach improves translation performance by 0.61 BLEU points (Papineni et al., 2002) using the additional translation model trained on the DP-inserted corpus. Working together with DP-generated input sentences achieves a further improvement of nearly 1.0 BLEU point. Furthermore, translation performance with N -best integration is much better than its 1-best counterpart (i.e. +0.84 BLEU points).
Generally, the contributions of this paper include the following: • We propose an automatic method to build a large-scale DP training corpus. Given that the DPs are annotated in the parallel corpus, models trained on this data are more appropriate to the translation task; • Benefiting from representation learning, our deep learning-based generation models are able to avoid ignore the complex featureengineering work while still yielding encouraging results; • To decrease the negative effects on translation caused by inserting incorrect DPs, we force the SMT system to arbitrate between multiple ambiguous hypotheses from the DP predictions. The rest of the paper is organized as follows. In Section 2, we describe our approaches to building the DP corpus, DP generator and SMT integration. Related work is described in Section 3. The experimental results for both the DP generator and translation are reported in Section 4. Section 5 analyses some real examples which is followed by our conclusion in Section 6.

Methodology
The architecture of our proposed method is shown in Figure 2, which can be divided into three phases: DP corpus annotation, DP generation, and SMT integration.

DP Training Corpus Annotation
We propose an approach to automatically annotate DPs by utilizing alignment information. Given a parallel corpus, we first use an unsupervised word alignment method (Och and Ney, 2003;Tu et al., 2012) to produce a word alignment. From observing of the alignment matrix, we found it is possible to detect DPs by projecting misaligned pronouns from the non-pro-drop target side (English) to the pro-drop source side (Chinese). In this work, we focus on nominative and accusative pronouns including personal, possessive and reflexive instances, as listed in Table 1.  We use an example to illustrate our idea. Figure 3 features a dropped pronoun "我" (not shown) on the source side, which is aligned to the second "I" (in red) on the target side. For each pronoun on the target side (e.g. "I", "you"), we first check whether it has an aligned pronoun on the source side. We find that the second "I" is not aligned to any source word and possibly corresponds to a DP I (e.g. "我"). To determine the possible positions of DP I on the source side, we employ a diagonal heuristic based on the observation that there exists a diagonal rule in the local area of the alignment matrix. For example, the alignment blocks in Figure 3 generally follow a diagonal line. Therefore, the pronoun "I" on the target side can be projected to the purple area (i.e. "你 说 过 想") on the source side, according to the preceding and following alignment blocks (i.e. "you-你" and "want-想").
However, there are still three possible positions to insert DP I (i.e. the three gaps in the purple area). To further determine the exact position of DP I , we generate possible sentences by inserting the corresponding Chinese DPs 1 into every possible position. Then we employ an n-gram language model (LM) to score these candidates and select the one with the lowest perplexity as final result. This LM-based projection is based on the observation that the amount and type of DPs are very different in different gen-res. We hypothesize that the DP position can be determined by utilizing the inconsistency of DPs in different domains. Therefore, the LM is trained on a large amount of webpage data (detailed in Section 3.1). Considering the problem of incorrect DP insertion caused by incorrect alignment, we add the original sentence into the LM scoring to reduce impossible insertions (noise).

DP Generation
In light of the recent success of applying deep neural network technologies in natural language processing (Raymond and Riccardi, 2007;Mesnil et al., 2013), we propose a neural network-based DP generator via the DP-inserted corpus (Section 2.1). We first employ an RNN to predict the DP position, and then train a classifier using multilayer perceptrons to generate our N -best DP results.

DP detection
The task of DP position detection is to label words if there are pronouns missing before the words, which can intuitively be regarded as a sequence labelling problem.
Word embeddings (Mikolov et al., 2013) are used for our generation models: given a word w (t) , we try to produce an embedding representation v (t) ∈ R d where d is the dimension of the representation vectors. In order to capture short-term temporal dependencies, we feed the RNN unit a window of context, as in Equation (1): where k is the window size. We employ an RNN (Mesnil et al., 2013) to learn the dependency of sentences, which can be formulated as Equation (2): where f (x) is a sigmoid function at the hidden layer. U is the weight matrix between the raw input and ID. Description Lexical Feature Set 1 S surrounding words around p 2 S surrounding POS tags around p 3 preceding pronoun in the same sentence 4 following pronoun in the same sentence Context Feature Set 5 pronouns in preceding X sentences 6 pronouns in following X sentences 7 nouns in preceding Y sentences 8 nouns in following Y sentences Syntax Feature Set 9 path from current word (p) to the root 10 path from preceding word (p − 1) to the root the hidden nodes, and V is the weight matrix between the context nodes and the hidden nodes. At the output layer, a softmax function is adopted for labelling, as in Equation (3): where g(z m ) = e zm k e z k , and W d is the output weight matrix.

DP prediction
Once the DP position is detected, the next step is to determine which pronoun should be inserted based on this result. Accordingly, we train a 22class classifier, where each class refers to a distinct Chinese pronoun in Table 1. We select a number of features based on previous work (Xiang et al., 2013;Yang et al., 2015), including lexical, contextual, and syntax features (as shown in Table 2). We set p as the DP position, S as the window size surrounding p, and X, Y as the window size surrounding current sentence (the one contains p). For Features 1-4, we extract words, POS tags and pronouns around p. For Features 5-8, we also consider the pronouns and nouns between X/Y surrounding sentences. For Features 9 and 10, in order to model the syntactic relation, we use a path feature, which is the combined tags of the sub-tree nodes from p/(p − 1) to the root. Note that Features 3-6 consider all pronouns that were not dropped. Each unique feature is treated as a word, and assigned a "word embedding". The embeddings of the features are then fed to the neural network. We fix the number of features for the variable-length features, where missing ones are tagged as None. Accordingly, all training instances share the same feature length. For the training data, we sample all DP instances from the corpus (annotated by the method in Section 2.1). During decoding, p can be given by our DP detection model.
We employ a feed-forward neural network with four layers. The input x p comprises the embeddings of the set of all possible feature indicator names. The middle two layers a (1) , a (2) use Rectified Linear function R as the activation function, as in Equation (4)-(5): where W p (1) and b (1) are the weights and bias connecting the first hidden layer to second hidden layer; and so on. The last layer y p adopts the softmax function g, as in Equation (6):

Integration into Translation
The baseline SMT system uses the parallel corpus and input sentences without inserting/generating DPs. As shown in Figure 2, the integration into SMT system is two fold: DP-inserted translation model (DP-ins. TM) and DP-generated input (DP-gen. Input).

DP-inserted TM
We train an additional translation model on the new parallel corpus, whose source side is inserted with DPs derived from the target side via the alignment matrix (Section 2.1). We hypothesize that DP insertion can help to obtain a better alignment, which can benefit translation. Then the whole translation process is based on the boosted translation model, i.e. with DPs inserted. As far as TM combination is concerned, we directly feed Moses the multiple phrase tables. The gain from the additional TM is mainly from complementary information about the recalled DPs from the annotated data.

DP-generated input
Another option is to pre-process the input sentence by inserting possible DPs with the DP generation model (Section 2.2) so that the DP-inserted input (Input ZH+DPs) is translated. The predicted DPs would be explicitly translated into the target language, so that the possibly missing pronouns in the translation might be recalled. This makes the input sentences and DP-inserted TM more consistent in terms of recalling DPs.

N-best inputs
However, the above method suffers from a major drawback: it only uses the 1-best prediction result for decoding, which potentially introduces translation mistakes due to the propagation of prediction errors. To alleviate this problem, an obvious solution is to offer more alternatives. Recent studies have shown that SMT systems can benefit from widening the annotation pipeline (Liu et al., 2009;Tu et al., 2010;Tu et al., 2011;Liu et al., 2013). In the same direction, we propose to feed the decoder N -best prediction results, which allows the system to arbitrate between multiple ambiguous hypotheses from upstream processing so that the best translation can be produced. The general method is to make the input with N -best DPs into a confusion network. In our experiment, each prediction result in the N-best list is assigned a weight of 1/N .

Related Work
There is some work related to DP generation. One is zero pronoun resolution (ZP), which is a subdirection of co-reference resolution (CR). The difference to our task is that ZP contains three steps (namely ZP detection, anaphoricity determination and co-reference link) whereas DP generation only contains the first two steps. Some researchers (Zhao and Ng, 2007;Kong and Zhou, 2010;Chen and Ng, 2013) propose rich features based on different machine-learning methods. For example, Chen and Ng (2013) propose an SVM classifier using 32 features including lexical, syntax and grammatical roles etc., which are very useful in the ZP task. However, most of their experiments are conducted on a small-scale corpus (i.e. OntoNotes) 2 and performance drops correspondingly when using a systemparse tree compared to the gold standard one. Novak and Zabokrtsky (2014) explore cross-language differences in pronoun behavior to affect the CR results. The experiment shows that bilingual feature sets are helpful to CR. Another line related to DP generation is using a wider range of empty categories (EC) (Yang and Xue, 2010;Cai et al., 2011;Xue and Yang, 2013), which aims to recover longdistance dependencies, discontinuous constituents and certain dropped elements 3 in phrase structure treebanks (Xue et al., 2005). This work mainly focus on sentence-internal characteristics as opposed to contextual information at the discourse level. More recently, Yang et al. (2015) explore DP recovery for Chinese text messages based on both lines of work.
These methods can also be used for DP translation using SMT (Chung and Gildea, 2010;Le Nagard and Koehn, 2010;Taira et al., 2012;Xiang et al., 2013). Taira et al. (2012) propose both simple rule-based and manual methods to add zero pronouns in the source side for Japanese-English translation. However, the BLEU scores of both systems are nearly identical, which indicates that only considering the source side and forcing the insertion of pronouns may be less principled than tackling the problem head on by integrating them into the SMT system itself. Le Nagard and Koehn (2010) present a method to aid English pronoun translation into French for SMT by integrating CR. Unfortunately, their results are not convincing due to the poor performance of the CR method (Pradhan et al., 2012). Chung and Gildea (2010) systematically examine the effects of EC on MT with three methods: pattern, CRF (which achieves best results) and parsing. The results show that this work can really improve the end translation even though the automatic prediction of EC is not highly accurate.

Setup
For dialogue domain training data, we extract around 1M sentence pairs (movie or TV episode subtitles) from two subtitle websites. 4 We manually create both development and test data with DP annotation. Note that all sentences maintain their con-textual information at the discourse level, which can be used for feature extraction in Section 2.1. The detailed statistics are listed in Table 3. As far as the DP training corpus is concerned, we annotate the Chinese side of the parallel data using the approach described in Section 2.1. There are two different language models for the DP annotation (Section 2.1) and translation tasks, respectively: one is trained on the 2.13TB Chinese Web Page Collection Corpus 5 while the other one is trained on all extracted 7M English subtitle data (Wang et al., 2016  We carry out our experiments using the phrasebased SMT model in Moses (Koehn et al., 2007) on a Chinese-English dialogue translation task. Furthermore, we train 5-gram language models using the SRI Language Toolkit (Stolcke, 2002). To obtain a good word alignment, we run GIZA++ (Och and Ney, 2003) on the training data together with another larger parallel subtitle corpus that contains 6M sentence pairs. 6 We use minimum error rate training (Och, 2003) to optimize the feature weights.
The RNN models are implemented using the common Theano neural network toolkit (Bergstra et al., 2010). We use a pre-trained word embedding via a lookup table. We use the following settings: windows = 5, the size of the single hidden layer = 200, iterations = 10, embeddings = 200. The MLP classifier use random initialized embeddings, with the following settings: the size of the single hidden layer = 200, embeddings = 100, iterations = 200.
For end-to-end evaluation, case-insensitive BLEU (Papineni et al., 2002)   translation performance and micro-averaged F-score is used to measure DP generation quality.

Evaluation of DP Generation
We first check whether our DP annotation strategy is reasonable. To this end, we follow the strategy to automatically and manually label the source sides of the development and test data with their target sides. The agreement between automatic labels and manual labels on DP prediction are 94% and 95% on development and test data and on DP generation are 92% and 92%, respectively. This indicates that the automatic annotation strategy is relatively trustworthy.
We then measure the accuracy (in terms of words) of our generation models in two phases. "DP Detection" shows the performance of our sequencelabelling model based on RNN. We only consider the tag for each word (pro-drop or not pro-drop before the current word), without considering the exact pronoun for DPs. "DP Prediction" shows the performance of the MLP classifier in determining the exact DP based on detection. Thus we consider both the detected and predicted pronouns. Table 4 lists the results of the above DP generation approaches. The F1 score of "DP Detection" achieves 88% and 86% on the Dev and Test set, respectively. However, it has lower F1 scores of 66% and 65% for the final pronoun generation ("DP Prediction") on the development and test data, respectively. This indicates that predicting the exact DP in Chinese is a really difficult task. Even though the DP prediction is not highly accurate, we still hypothesize that the DP generation models are reliable enough to be used for end-to-end machine translation. Note that we only show the results of 1-best DP generation here, but in the translation task, we use N -best generation candidates to recall more DPs.

Evaluation of DP Translation
In this section, we evaluate the end-to-end translation quality by integrating the DP generation results (Section 3.3). Table 5 summaries the results of translation performance with different sources of DP information. "Baseline" uses the original input to feed the SMT system. "+DP-ins. TM" denotes using an additional translation model trained on the DPinserted training corpus, while "+DP-gen. Input N" denotes further completing the input sentences with the N -best pronouns generated from the DP generation model. "Oracle" uses the input with manual ("Manual") or automatic ("Auto") insertion of DPs by considering the target set. Taking "Auto Oracle" for example, we annotate the DPs via alignment information (supposing the reference is available) using the technique described in Section 2.1.
The baseline system uses the parallel corpus and input sentences without inserting/generating DPs. It achieves 20.06 and 18.76 in BLEU score on the development and test data, respectively. The BLEU scores are relatively low because 1) we have only one reference, and 2) dialogue machine translation is still a challenge for the current SMT approaches.
By using an additional translation model trained on the DP-inserted parallel corpus as described in Section 2.1, we improve the performance consistently on both development (+0.26) and test data (+0.61). This indicates that the inserted DPs are helpful for SMT. Thus, the gain in the "+DP-ins TM" is mainly from the improved alignment quality.
We can further improve translation performance by completing the input sentences with our DP gen-eration model as described in Section 2.2. We test N -best DP insertion to examine the performance, where N ={1, 2, 4, 6, 8}. Working together with "DP-ins. TM", 1-best generated input already achieves +0.43 and + 0.74 BLEU score improvements on development and test set, respectively. The consistency between the input sentences and the DPinserted parallel corpus contributes most to these further improvements. As N increases, the BLEU score grows, peaking at 21.61 and 20.34 BLEU points when N =6. Thus we achieve a final improvement of 1.55 and 1.58 BLEU points on the development and test data, respectively. However, when adding more DP candidates, the BLEU score decreases by 0.97 and 0.51. The reason for this may be that more DP candidates add more noise, which harms the translation quality.
The oracle system uses the input sentences with manually annotated DPs rather than "DP-gen. Input". The performance gap between "Oracle" and "+DP-gen. Input" shows that there is still a large space (+4.22 or +3.17) for further improvement for the DP generation model.

Case Study
We select sample sentences from the test set to further analyse the effects of DP generation on translation.
In Figure 4, we show an improved case (Case A), an unchanged case (Case B), and a worse case (Case C) of translation no-/using DP insertion (i.e. "+DP-gen. Input 1-best"). In each case, we give (a) the original Chinese sentence and its translation, (b) the DP-inserted Chinese sentence and its translation, and (c) the reference English sentence. In Case A, "Do you" in the translation output is compensated by adding DP 你 (you) in (b), which gives a better translation than in (a). In contrast, in case C, our DP generator regards the simple sentence as a compound sentence and insert a wrong pronoun 我 (I) in (b), which causes an incorrect translation output (worse than (a)). This indicates that we need a highly accurate parse tree of the source sentences for more correct completion of the antecedent of the DPs. In Case B, the translation results are the same in (a) and (b). This kind of unchanged case always occurs in "fixed" linguistic chunks such as prepo- sition phrases ("on my way"), greetings ("see you later" , "thank you") and interjections ("My God"). However, the alignment of (b) is better than that of (a) in this case. Figure 5 shows an example of "+DP-gen. Input N-best" translation. Here, (a) is the original Chinese sentence and its translation; (b) is the 1-best DP-generated Chinese sentence and its MT output; (c) stands for 2-best, 4-best and 6-best DP-generated Chinese sentences and their MT outputs (which are all the same); (d) is the 8-best DP-generated Chinese sentence and its MT output; (e) is the reference. The N -best DP candidate list is 我 (I), 你 (You), 他 (He), 我们 (We), 他们 (They), 你们 (You), 它 (It) and 她 (She). In (b), when integrating an incorrect 1-best DP into MT, we obtain the wrong translation. However, in (c), when considering more DPs (2-/4-/6-best), the SMT system generates a perfect translation by weighting the DP candidates during decoding. When further increasing N (8-best), (d) shows a wrong translation again due to increased noise. Figure 5: Effects of N-best DP generation for translation.

Conclusion and Future Work
We have presented a novel approach to recall missing pronouns for machine translation from a prodrop language to a non-pro-drop language. Experiments show that it is crucial to identify the DP to improve the overall translation performance. Our analysis shows that insertion of DPs affects the translation in a large extent.
Our main findings in this paper are threefold: • Bilingual information can help to build monolingual models without any manually annotated training data; • Benefiting from representation learning, neural network-based models work well without complex feature engineering work; • N -best DP integration works better than 1-best insertion.
In future work, we plan to extend our work to different genres, languages and other kinds of dropped words to validate the robustness of our approach.