Learning to Jointly Translate and Predict Dropped Pronouns with a Shared Reconstruction Mechanism

Pronouns are frequently omitted in pro-drop languages, such as Chinese, generally leading to significant challenges with respect to the production of complete translations. Recently, Wang et al. (2018) proposed a novel reconstruction-based approach to alleviating dropped pronoun (DP) translation problems for neural machine translation models. In this work, we improve the original model from two perspectives. First, we employ a shared reconstructor to better exploit encoder and decoder representations. Second, we jointly learn to translate and predict DPs in an end-to-end manner, to avoid the errors propagated from an external DP prediction model. Experimental results show that our approach significantly improves both translation performance and DP prediction accuracy.


Introduction
Pronouns are important in natural languages as they imply rich discourse information. However, in pro-drop languages such as Chinese and Japanese, pronouns are frequently omitted when their referents can be pragmatically inferred from the context. When translating sentences from a pro-drop language into a non-pro-drop language (e.g. Chinese-to-English), translation models generally fail to translate invisible dropped pronouns (DPs). This phenomenon leads to various translation problems in terms of completeness, syntax and even semantics of translations. A number of approaches have been investigated for DP translation (Le Nagard and Koehn, 2010;Xiang et al., 2013;Wang et al., 2016Wang et al., , 2018. Wang et al. (2018) is a pioneering work to model DP translation for neural machine trans- * Zhaopeng Tu is the corresponding author of the paper. This work was conducted when Longyue Wang was studying and Qun Liu was working at the ADAPT Centre in the School of Computing at Dublin City University. lation (NMT) models. They employ two separate reconstructors  to respectively reconstruct encoder and decoder representations back to the DP-annotated source sentence. The annotation of DP is provided by an external prediction model, which is trained on the parallel corpus using automatically learned alignment information (Wang et al., 2016). Although this model achieved significant improvements, there nonetheless exist two drawbacks: 1) there is no interaction between the two separate reconstructors, which misses the opportunity to exploit useful relations between encoder and decoder representations; and 2) the external DP prediction model only has an accuracy of 66% in F1-score, which propagates numerous errors to the translation model.
In this work, we propose to improve the original model from two perspectives. First, we use a shared reconstructor to read hidden states from both encoder and decoder. Second, we integrate a DP predictor into NMT to jointly learn to translate and predict DPs. Incorporating these as two auxiliary loss terms can guide both the encoder and decoder states to learn critical information relevant to DPs. Experimental results on a largescale Chinese-English subtitle corpus show that the two modifications can accumulatively improve translation performance, and the best result is +1.5 BLEU points better than that reported by Wang et al. (2018). In addition, the jointly learned DP prediction model significantly outperforms its external counterpart by 9% in F1-score.

Background
As shown in Figure 1, Wang et al. (2018) introduced two independent reconstructors with their own parameters, which reconstruct the DPannotated source sentence from the encoder and decoder hidden states, respectively. The central Prediction F1-score Example DP Position 88% Table 1: Evaluation of external models on predicting the positions of DPs ("DP Position") and the exact words of DP ("DP Words").
idea underpinning their approach is to guide the corresponding hidden states to embed the recalled source-side DP information and subsequently to help the NMT model generate the missing pronouns with these enhanced hidden representations. The DPs can be automatically annotated for training and test data using two different strategies (Wang et al., 2016). In the training phase, where the target sentence is available, we annotate DPs for the source sentence using alignment information. These annotated source sentences can be used to build a neural-based DP predictor, which can be used to annotate test sentences since the target sentence is not available during the testing phase. As shown in Table 1, Wang et al. (2016Wang et al. ( , 2018 explored to predict the exact DP words 1 , the accuracy of which is only 66% in F1-score. By analyzing the translation outputs, we found that 16.2% of errors are newly introduced and caused by errors from the DP predictor. Fortunately, the accuracy of predicting DP positions (DPPs) is much higher, which provides the chance to alleviate the error propagation problem. Intuitively, we can learn to generate DPs at the predicted positions using a jointly trained DP predictor, which is fed with informative representations in the reconstructor.
1 Unless otherwise indicated, in the paper, the terms "DP" and "DP word" are identical.

Shared Reconstructor
Recent work shows that NMT models can benefit from sharing a component across different tasks and languages. Taking multi-language translation as an example, Firat et al. (2016) share an attention model across languages while Dong et al. (2015) share an encoder. Our work is most similar to the work of Zoph and Knight (2016) and Anastasopoulos and Chiang (2018), which share a decoder and two separate attention models to read from two different sources. In contrast, we share information at the level of reconstructed frames.
The architectures of our proposed shared reconstruction model are shown in Figure 2(a). Formally, the reconstructor reads from both the encoder and decoder hidden states, as well as the DP-annotated source sentence, and outputs a reconstruction score. It uses two separate attention models to reconstruct the annotated source sentencex = {x 1 ,x 2 , . . . ,x T } word by word, and the reconstruction score is computed by where h rec t is the hidden state in the reconstructor, and computed by Equation (1): Here g r (·) and f r (·) are respectively softmax and activation functions for the reconstructor. The context vectorsĉ enc t andĉ dec t are the weighted sum of h enc and h dec , respectively, as in Equation (2) and (3) Note that the weightsα enc andα dec are calculated by two separate attention models. We propose two attention strategies which differ as to whether the two attention models have interactions or not.
Independent Attention calculates the two weight matrices independently, as in Equation (4) and (5): where ATT enc (·) and ATT dec (·) are two separate attention models with their own parameters.  Interactive Attention feeds the context vector produced by one attention model to another attention model. The intuition behind this is that the interaction between two attention models can lead to a better exploitation of the encoder and decoder representations. As the interactive attention is directional, we have two options (Equation (6) and (7)) which modify either ATT enc (·) or ATT dec (·) while leaving the other one unchanged: • enc→dec: • dec→enc:

Joint Prediction of Dropped Pronouns
Inspired by recent successes of multi-task learning (Dong et al., 2015;Luong et al., 2016), we propose to jointly learn to translate and predict DPs (as shown in Figure 2(b)). To ease the learning difficulty, we leverage the information of DPPs predicted by an external model, which can achieve an accuracy of 88% in F1-score. Accordingly, we transform the original DP prediction problem to DP word generation given the pre-predicted DP positions. Since the DPP-annotated source sentence serves as the reconstructed input, we introduce an additional DP-generation loss, which measures how well the DP is generated from the corresponding hidden state in the reconstructor. Let dp = {dp 1 , dp 2 , . . . , dp D } be the list of DPs in the annotated source sentence, and h rec = {h rec 1 , h rec 2 , . . . , h rec D } be the corresponding hidden states in the reconstructor. The generation probability is computed by where g p (·) is softmax for the DP predictor.

Training and Testing
We train both the encoder-decoder and the shared reconstructors together in a single end-to-end process, and the training objective is J(θ, γ, ψ) = arg max θ,γ,ψ log L(y|x; θ) likelihood + log R(x|h enc , h dec ; θ, γ) reconstruction + log P (dp|ĥ rec ; θ, γ, ψ) prediction (9) where {θ, γ, ψ} are respectively the parameters associated with the encoder-decoder, shared reconstructor and the DP prediction model. The auxiliary reconstruction objective R(·) guides the related part of the parameter matrix θ to learn better latent representations, which are used to reconstruct the DPP-annotated source sentence. The auxiliary prediction loss P (·) guides the related part of both the encoder-decoder and the reconstructor to learn better latent representations, which are used to predict the DPs in the source sentence.
as a reranking technique to select the best translation candidate from the generated n-best list at testing time. Different from Wang et al. (2018), we reconstruct DPP-annotated source sentence, which is predicted by an external model.

Setup
To compare our work with the results reported by previous work (Wang et al., 2018), we conducted experiments on their released Chinese⇒English TV Subtitle corpus. 2 The training, validation, and test sets contain 2.15M, 1.09K, and 1.15K sentence pairs, respectively. We used case-insensitive 4-gram NIST BLEU metrics (Papineni et al., 2002) for evaluation, and sign-test (Collins et al., 2005) to test for statistical significance. We implemented our models on the code repository released by Wang et al. (2018). 3 We used the same configurations (e.g. vocabulary size = 30K, hidden size = 1000) and reproduced their reported results. It should be emphasized that we did not use the pre-train strategy as done in Wang et al. (2018), since we found training from scratch achieved a better performance in the shared reconstructor setting.
2 https://github.com/longyuewangdcu/ tvsub 3 https://github.com/tuzhaopeng/nmt Table 2 shows the translation results. It is clear that the proposed models significantly outperform the baselines in all cases, although there are considerable differences among different variations.

Results
Baselines (Rows 1-4): The three baselines (Rows 1, 2, and 4) differ regarding the training data used. "Separate-Recs⇒(+DPs)" (Row 3) is the best model reported in Wang et al. (2018), which we employed as another strong baseline. The baseline trained on the DPP-annotated data ("Baseline (+DPPs)", Row 4) outperforms the other two counterparts, indicating that the error propagation problem does affect the performance of translating DPs. It suggests the necessity of jointly learning to translate and predict DPs.
Our Models (Rows 5-8): Using our shared reconstructor (Row 5) not only outperforms the corresponding baseline (Row 4), but also surpasses its separate reconstructor counterpart (Row 3). Introducing a joint prediction objective (Row 6) can achieve a further improvement of +0.61 BLEU points. These results verify that shared reconstructor and jointly predicting DPs can accumulatively improve translation performance. Among the variations of shared reconstructors (Rows 6-8), we found that an interaction attention from encoder to decoder (Row 7) achieves the best performance, which is +3.45 BLEU points better than our baseline (Row 4) and +1.45 BLEU points better than the best result reported by Wang et al. (2018) (Row 3). We attribute the superior performance of "Shared-Rec enc→dec " to the fact that the attention context over encoder representations embeds useful DP information, which can help to better attend to the representations of the corresponding pronouns in the decoder side. Similar to Wang et al. (2018), the proposed approach improves BLEU scores at the cost of decreased training and decoding speed, which is due to the large number of newly introduced parameters resulting from the incorporation of reconstructors into the NMT model.  Table 3: Evaluation of DP prediction accuracy. "External" model is separately trained on DP-annotated data with external neural methods (Wang et al., 2016), while "Joint" model is jointly trained with the NMT model (Section 3.2). Table 3, the jointly learned model significantly outperforms the external one by 9% in F1-score. We attribute this to the useful contextual information embedded in the reconstructor representations, which are used to generate the exact DP words.   Table 4 lists translation results when the reconstruction model is used in training only. We can see that the proposed model outperforms both the strong baseline and the best model reported in Wang et al. (2018). This is encouraging since no extra resources and computation are introduced to online decoding, which makes the approach highly practical, for example for translation in industry applications.  Translation performance gap (" ") between manually ("Man.") and automatically ("Auto.") labelling DPs/DPPs for input sentences in testing.

Contribution Analysis
Effect of DPP Labelling Accuracy For each sentence in testing, the DPs and DPPs are labelled automatically by two separate external prediction models, the accuracy of which are respectively 66% and 88% measured in F1 score. We investigate the best performance the models can achieve with manual labelling, which can be regarded as an "Oracle", as shown in Table 5. As seen, there still exists a significant gap in performance, and this could be improved by improving the accuracy of our DPP generator. In addition, our models show a relatively smaller distance in performance from the oracle performance ("Man"), indicating that the error propagation problem is alleviated to some extent.

Conclusion
In this paper, we proposed effective approaches of translating DPs with NMT models: shared reconstructor and jointly learning to translate and predict DPs. Through experiments we verified that 1) shared reconstruction is helpful to share knowledge between the encoder and decoder; and 2) joint learning of the DP prediction model indeed alleviates the error propagation problem by improving prediction accuracy. The two approaches accumulatively improve translation performance. The method is not restricted to the DP translation task and could potentially be applied to other sequence generation problems where additional source-side information could be incorporated.
In future work we plan to: 1) build a fully end-to-end NMT model for DP translation, which does not depend on any external component (i.e. DPP predictor); 2) exploit cross-sentence context (Wang et al., 2017) to further improve DP translation; 3) investigate a new research strand that adapts our model in an inverse translation direction by learning to drop pronouns instead of recovering DPs.