Translating Implicit Discourse Connectives Based on Cross-lingual Annotation and Alignment

Implicit discourse connectives and relations are distributed more widely in Chinese texts, when translating into English, such connectives are usually translated explicitly. Towards Chinese-English MT, in this paper we describe cross-lingual annotation and alignment of dis-course connectives in a parallel corpus, describing related surveys and findings. We then conduct some evaluation experiments to testify the translation of implicit connectives and whether representing implicit connectives explicitly in source language can improve the final translation performance significantly. Preliminary results show it has little improvement by just inserting explicit connectives for implicit relations.


Introduction
Discourse relations refer to various relations between elementary discourse units(EDUs) in discourse structures, these relations are usually expressed explicitly or implicitly by certain surface words known as discourse connectives(DCs).
Distribution of DCs varies between different languages. Let's just take Chinese and English for example. According to previous surveys, explicit and implicit DCs account for 22% and 76% respectively in the Chinese Discourse Treebank(CDTB) (Zhou and Xue, 2015), while they account for 45% and 40% in the Penn Discourse Treebank (PDTB) (Prasad et al., 2008), indicating that there are more implicit DCs in Chinese, correspondingly, discourse relations are usually implicit.
DCs should have some impacts on the translation performance and quality. As Chinese tends to use more implicit DCs, such DCs will be expressed explicitly when necessary in Chinese-English translation. Here is an example sentence show the implicit relation. 天气预报 说 今天 会 下雨 ， weather report say today will rain "Weather report says it will rain today," 我们 决定 不 在 公园 举办 演唱会。 We decide not in park hold concert "We decide not to hold the concert in the park." There is no explicit DC between the two Chinese sub-sentences in the simple example, and the implicit discourse relation is CAUSAL. While translating into English, it is better to add an explicit DC such as "so/thus" before the second subsentence to express the relation, which will also make the translation more fluent and more acceptable.
In this paper, based on bilingual corpus, we first present cross-lingual annotation of DCs on both cross-sentence and within-sentence levels, and describe some related findings, then make a further survey on how to translate implicit DCs in Chinese-English discourse-level MT, and whether translation of DCs will have some impacts on final MT outputs.
The rest of the paper are organized as follows: section2 introduce some related works. Section 3 present annotation and findings of DCs in the bilingual parallel corpus. Section4 discuss some preliminary experiment results and analysis. And last section follow the conclusion.

Related Work
Discourse related issues have become increasingly popular in Natural Language Processing in recent years, especially the release of some famous discourse treebanks including PDTB, CDTB and RST (Mann and Thompson, 1986) corpus has promoted the research greatly.
Some research (Li et al. 2014, Rutherford andXue, 2014) has done on monolingual annotation and analysis of Chinese DCs. Li et al. (2014a) and Yung et al. (2015aYung et al. ( , 2015b) also present some cross-lingual discourse relation analysis. But they just analyze within sentences instead of crosssentences.
In the field of MT, some previous works have been mainly focus on DCs in European language pairs (Becher, 2011;Zufferey and Cartoni, 2014) such as English, French and German, including but not limited to disambiguating DCs in translation (Meyer et al., 2011;Meyer and Popescu-Belis, 2012), labeled and implicit DCs translation (Meyer and Webber, 2013).
As for Chinese discourse relations and translation, Tu et al. (2013) employ a RST-based discourse parsing approaches in SMT, in their following work (Tu et al. 2014), they also present a tree-string model on Chinese complex sentences, integrating discourse relations into MT, gaining some improvement on translation performance. Li et al. (2014b) argues the influence of discourse factors in translation.

Cross-lingual Annotations of DCs
In order to investigate the DCs in the translation, we first manually align DCs in Chinese and English in the bilingual corpus, News-commentary corpus 1 downloaded from OPUS 2 (Tiedemann, 2012), then further annotate them with essential information on both the source and target sides.
The reasons why we choose news-commentary corpus lie in two sides: first, each line in the corpus usually includes several consecutive sentences, and each sentence is further composed of several sub-sentences(clauses), which provide rich cross-sentence and within-sentence discourselevel information. Second, sentences in each line are neither too long nor too short, which are suitable to train the MT models.
In this part, we will describe the annotation scheme and some corresponding findings.

Annotation Principles
As mentioned above, we will analyze the DCs on both cross-sentence and within-sentence levels, we decide to annotate the corpus in a top-down way. That is, we first annotate DCs between crosssentences, and then within the sentences. Note that, if there exist sentences end with only full stop marks and have no commas or other punctuations, these sentences will not be annotated. Because they have no sub-sentences, and have no corresponding discourse relations within the sentence.
Here (Sweden's assumption of the EU Presidency this month should help these efforts. However, it comes at a time when the Union's eastern neighborhood faces severe challenges, and the financial and economic crisis hitting many of the partner countries hard.) The example has two consecutive sentences a and b, we first need to indicate the DC and relation between them. Next, we will continue to analyze in b. As sentence a has no sub-sentences, we don't need to analyze on it.
Based on the principle, we first randomly extract 5,000 cross-sentence pairs from the corpus by using systematic sampling approach, and then extract possible sentences from the pairs. Note that, as quite preliminary research, all current annotation is done by the first author of the paper alone, who is a PhD student majored in Linguistics and Computational Linguistics. As a result, unlike many previous works on corpus annotation, we don't conduct consistency experiments between different annotators to justify the performance of annotation until now. But we try to guarantee the annotation quality as much as possible. In the future, we will expand the annota-tion size, asking other annotators to work together on the corpus and minimize the inconsistence during the annotation.

Annotation Labels
Inspired by (Yung, et al., 2015b), in our annotation scheme, we design several following labels. Most labels will be annotated on both cross/within-sentence levels on bilingual sides. Nature of relations. Indicating the relations belong to explicit (E) or implicit (I) relations.
Explicit DCs. Annotating explicit DCs(EDCs) appeared in the sentences. On Chinese side, we try to find out all the possible DCs as much as possible. As for English, the DCs are annotated based on the 100 distinct types of explicit connectives in PDTB.
Implicit DCs(IDCs). If there are no explicit connectives in the sentences, proper DCs are inserted according to the discourse relations. If insertion is not grammatical, the DC is labelled as 'redundant'.
AltLex. This label is only for English side, referring relations a discourse relation that cannot be isolated from context as an explicit DCs.
Semantic types of discourse relations. Considering the expression features of Chinese, based on the 8 senses of relations defined in CDTB, we also add 5 other relation types on Chinese side (shown in following table). As on English side, we adapt 4 top-level discourse senses defined in PDTB, namely Expansion(EXP), Contingency (CON), Comparison (COM) and Temporal(TEM).

Causation Purpose Conditional Temporal Conjunction Progression Contrast
Expansion hypothetical concession example explanation successive  According to the scheme, annotation of the above example in section 3.1 is shown in above table.

Annotation Statistics
Through the annotation, we annotate 5,000 crosssentences and 8163 sentences, finally getting 5000 pairs of cross-sentence and 9308 within-sentence relations.    Table3 shows on cross-sentence level, there exist more implicit DCs both in Chinese and English. The discourse relation "Consecutive" occupies highest frequency. While on within-sentence level there are still more implicit DCs than explicit ones in Chinese, but in English, their proportions are similar. The bilingual distribution of DCs in news-commentary corpus once again prove the similar findings in CDTB and PDTB before. We can also conclude that discourse relation types are more various within sentences, on the other hand, relations between sentences seem not so close, sentences are often independent with each other.

Cross-sentence within-sentence
From the DC alignment matrixes in Table4 and 5, most explicit Chinese DCs usually have corresponding explicit DC translations. As for implicit DCs, although most of them map to implicit DCs on English side, there are still about 30% of them are aligned to explicit ones, indicating the important status and common usage of explicit DCs in English discourse structures.
We also find a quite prominent and interesting phenomenon that, a range of implicit discourse relations in Chinese, such as Temporal, Conjunction, Coordination and Causation, all can be mapped to the simple explicit DC "and" in English, with a rather high frequency. Just as similar conclusion shown in Appendix A of the PDTB 2.0 Annotation Manual 3 , as one of top ten polysemous DCs, "and" can represent more than 15 senses in 3000 sentences in PDTB.

Preliminary Experiments & Analysis
We conduct MT automatic evaluation experiments on the annotated Chinese sentences with inserted implicit DCs to testify the translation performance before and after representing implicit DCs with explicit ones. Evaluation metrics include BLEU (Papineni et al., 2002) and METEOR (Lavie and Agarwal, 2007) scores, calculated by the Asiya toolkit 4 (Giménez and Màrquez, 2010).

Experimental Setting
With Moses decoder (Koehn et al., 2007), we train a phrase-based SMT model on another different version of News-Commentary corpus 5 provided respectively by OPUS (69,206 sentence pairs) and WMT2017 Shared Task 6 (235,724 pairs), and the model is tuned by MERT (Och, 2003) with the development sets (2002 pairs) provided by WMT2017. GIZA++ (Och and Ney, 2003) is used for automatic word alignment and a 5-gram language model is trained on English Gigaword (Parker et al., 2011). 1500 sentences randomly chosen from the annotated corpus in section3 are used as test sets.
The training data is not annotated with any discourse information, and thus the translation models are not trained with any discourse markups. But as the training data include both explicit and implicit DCs, it is suitable for the experiments.  Table 6: Evaluation scores of MT outputs Table 6 shows the scores for SMT outputs of the test sets without/with inserting implicit DCs for source language. The scores indicate that adding explicit DCs for implicit DCs in Chinese seems have little improvement and impacts on translation performance.

Experimental Results and
We guess one reason resulting in the scores is that, although DCs appear frequently in English, they usually occupy a very small portion of total word counts in the MT outputs and may not very sensitive to BLEU. As (Meyer et al., 2012) also argues that translation of DCs can actually be improved while BLEU scores remain similar.
After manually analyzing some sentences of the outputs, it is observed that after inserting explicit DCs for implicit relations, most of them are indeed translated and aligned to the source side, just as the examples shown in following table7, stating that our preprocessing for the implicit DCs can be identified by the decoder. But, if we compare the translated DCs with those in reference, some of them are different, thus the n-gram based BLEU evaluation will not able to capture the information, which support our guess. These economies need measures that help to keep the poor out of poverty traps, and that give them realistic opportunities to improve their economic well-being.

MT:
These countries need to take measures to help the poor get rid of poverty traps and give them real opportunities to improve their economic well-being.

Conclusion
In this paper, we cross-lingually annotate and align DCs from both the cross-sentence and within-sentence levels on a Chinese-English parallel corpus. Based on the annotation, we present some statistics and basic findings on DCs, which have some accordance with previous survey.
We also conduct some preliminary MT evaluation experiments to testify the impacts on translation performance resulted from expressing implicit DCs explicitly. Although the results temporarily indicate no significant improvement of MT outputs, preprocessing DCs for MT indeed has some positive effects, we still believe that DCs are one of useful factors that cannot be ignored for discourse-level MT.
In the future, we need to consider other possible discourse-related information and integrate them into MT, on the other hand, it is also worthy considering more on the issue that how to evaluate discourse-MT outputs properly, after all, BLEU scores alone may not enough.