Improving the Translation of Discourse Markers for Chinese into English

Discourse markers (DMs) are ubiquitous co-hesive devices used to connect what is said or written. However, across languages there is divergence in their usage, placement, and frequency, which is considered to be a major problem for machine translation (MT). This paper presents an overview of a proposed thesis, exploring the difﬁculties around DMs in MT, with a focus on Chinese and English. The thesis will examine two main areas: modelling cohesive devices within sentences and modelling discourse relations (DRs) across sentences. Initial experiments have shown promising results for building a prediction model that uses linguistically inspired features to help improve word alignments with respect to the implicit use of cohesive devices, which in turn leads to improved hierarchical phrase-based MT.


Introduction
Statistical Machine Translation (SMT) has, in recent years, seen substantial improvements, yet approaches are not able to achieve high quality translations in many cases. The problem is especially prominent with complex composite sentences and distant language pairs, largely due to computational complexity. Rather than considering larger discourse segments as a whole, current SMT approaches focus on the translation of single sentences independently, with clauses and short phrases being treated in isolation. DMs are seen as a vital contextual link between discourse segments and could be used to guide translations in order to improve accuracy. However, they are often translated into the target language in ways that differ from how they are used in the source language (Hardmeier, 2012a;Meyer and Popescu-Belis, 2012). DMs can also signal numerous DRs and current SMT approaches do not adequately recognise or distinguish between them during the translation process (Hajlaoni and Popescu-Belis, 2013). Recent developments in SMT potentially allow the modelling of wider discourse information, even across sentences (Hardmeier, 2012b), but currently most existing models appear to focus on producing well translated localised sentence fragments, largely ignoring the wider global cohesion.
Five distinct cohesive devices have been identified (Halliday and Hasan, 1976), but for this thesis the pertinent devices that will be examined are conjunction (DMs) and (endophoric) reference. Conjunction is pertinent as it encompasses DMs, whilst reference includes pronouns (amongst other elements), which are often connected with the use of DMs (e.g. 'Because John ..., therefore he ...').
The initial focus is on the importance of DMs within sentences, with special attention given to implicit markers (common in Chinese) and a number of related word alignment issues. However, the final thesis will cover two main areas: • Modelling cohesive devices within sentences • Modelling discourse relations across sentences and wider discourse segments.
This paper is organized as follows. In Section 2 a survey of related work is conducted. Section 3 outlines the initial motivation and research including a preliminary corpus analysis. It covers examples that highlight various problems with the translation of (implicit) DMs, leading to an initial intuition. Section 4 looks at experiments and word alignment issues following a deeper corpus analysis and discusses how the intuition led towards developing the methodology used to study and improve word alignments. It also includes the results of the experiments that show positive gains in BLEU. Section 5 provides an outline of the future work that needs to be carried out. Finally, Section 6 is the conclusion.

Literature Review
This section is a brief overview of some of the pertinent important work that has gone into improving SMT with respect to cohesion. Specifically the focus is on the areas of: identifying and annotating DMs, working with lexical and grammatical cohesion, and translating implicit DRs.

Identifying and Annotating Chinese DMs
A study on translating English discourse connectives (DCs) (Hajlaoni and Popescu-Belis, 2013) showed that some of them in English can be ambiguous, signalling a variety of discourse relations. However, other studies have shown that sense labels can be included in corpora and that MT systems can take advantage of such labels to learn better translations (Pitler and Nenkova, 2009;Meyer and Popescu-Belis, 2012). For example, The Penn Discourse Treebank project (PDTB) adds annotation related to structure and discourse semantics with a focus on DRs and can be used to guide the extraction of DR inferences. The Chinese Discourse Treebank (CDTB) adds an extra layer to the annotation in the PDTB (Xue, 2005) focussing on DCs as well as structural and anaphoric relations and follows the lexically grounded approach of the PDTB.
The studies also highlight how anaphoric relations can be difficult to capture as they often have one discourse adverbial linked with a local argument, leaving the other argument to be established from elsewhere in the discourse. Pronouns, for example, are often used to link back to some discourse entity that has already been introduced. This essentially suggests that arguments identified in anaphoric relations

English
Chinese DC although(1)/but (2) (1) 虽然，虽说，虽 (2)但，可是，却 because(1)/therefore (2) (1) 因为，因，由于 (2)所以 if(1)/then (2) (1) 如果，假如，若 (2)就 can cover a long distance and Xue (2005) argues that one of the biggest challenges for discourse annotation is establishing the distance of the text span and how to decide on what discourse unit should be included or excluded from the argument. There are also some additional challenges such as variants or substitutions of DCs. Table 1 (Xue, 2005) shows a range of DCs that can be used interchangeably. The numbers indicate that any marker from (1) can be paired with any marker from (2) to form a compound sentence with the same meaning.

Lexical and Grammatical Cohesion
Previous work has attempted to address lexical and grammatical cohesion in SMT (Gong et al., 2011;Xiao et al., 2011;Wong and Kit, 2012;Xiong et al., 2013b) although their results are still relatively limited (Xiong et al., 2013a). Lexical cohesion is determined by identifying lexical items forming links between sentences in text (also lexical chains). A number of models have been proposed in order to try and capture document-wide lexical cohesion and when implemented they showed significant improvements over the baseline (Xiong et al., 2013a).
Lexical chain information (Morris and Hirst, 1991) can be used to capture lexical cohesion in text and it is already successfully used in a range of fields such as information retrieval and the summarisation of documents (Xiong et al., 2013b). The work of Xiong et al. (2013b) introduces two lexical chain models to incorporate lexical cohesion into document wide SMT and experiments show that, compared to the baseline, implementing these models substantially improves translation quality. Unfortunately with limited grammatical cohesion, propagated by DMs, translations can be difficult to understand, especially if there is no context provided by local discourse segments.
To achieve improved grammatical cohesion Tu et al. (2014) propose creating a model that generates transitional expressions through using complex sentence structure based translation rules alongside a generative transfer model, which is then incorporated into a hierarchical phrase-based system. The test results show significant improvements leading to smoother and more cohesive translations. One of the key reasons for this is through reserving cohesive information during the training process by converting source sentences into "tagged flattened complex sentence structures" (Tu et al., 2014) and then performing word alignments using the translation rules. It is argued that connecting complex sentence structures with transitional expressions is similar to the human translation process (Tu et al., 2014) and therefore improvements have been made showing the effectiveness of preserving cohesion information.

Translation of Implicit Discourse Relations
It is often assumed that the discourse information captured by the lexical chains is mainly explicit. However, these relations can also be implicitly signalled in text, especially for languages such as Chinese where implicitation is used in abundance (Yung, 2014). Yung (2014) explores DM annotation schemes such as the CDTB (2.1) and observes that explicit relations are identified with an accuracy of up to 94%, whereas with implicit relations this can drop as low as 20% (Yung, 2014). To overcome this, Yung proposes implementing a discourse-relation aware SMT system, that can serve as a basis for producing a discourse-structure-aware, document-level MT system. The proposed system will use DC annotated parallel corpora, that enables the integration of discourse knowledge. Yung argues that in Chinese a segment separated by punctuation is considered to be an elementary discourse unit (EDU) and that a running Chinese sentence can contain many such segments. However, the sentence would still be translated into one single English sentence, separated by ungrammatical commas and with a distinct lack of connectives. The connectives are usually explicitly required for the English to make sense, but can remain implicit in the Chinese (Yung, 2014). However, this work is still in the early stages.

Motivation
This section outlines the initial research, including a preliminary corpus analysis, examining difficulties with automatically translating DMs across distant languages such as Chinese and English. It draws attention to deficiencies caused from under-utilising discourse information and examines divergences in the usage of DMs. The final part of this section outlines the intuition garnered from the given examples and highlights the approach to be undertaken.
For the corpus analysis, research, and experiments three main parallel corpora are used: • Basic Travel Expression Corpus (BTEC): Primarily made up of short simple phrases that occur in travel conversations. It contains 44, 016 sentences in each language with over 250, 000 Chinese characters and over 300, 000 English words (Takezawa et al., 2012). Both examples show 'because' (因为) being used in different ways and in each case the automated translations fall short. In Ex1 the dropped (implied) pronoun in the second clause could be the problem, whilst in Ex2 significant reordering is needed as 'because' should be linked to 'this' (这 个) -the topic -rather than 'medicine' (药). The 'this' (这 个) refers to an 'ailment', which is hard to capture from a single sentence. Information preserved from a larger discourse segment may have provided more clues, but as is, the sentence appears somewhat exophoric and the meaning cannot necessarily be gleaned from the text alone.
Ex (3) 一有空位我们就给你打电话。 as soon as have space we then give you make phone. We'll call you as soon as there is an opening. (BTEC) A space that we have to give you a call. (Google) In Ex3 the characters '一' and '就' are working together as coordinating markers in the form: ...一VP a 就 VP b . However, individually these characters have significantly different meanings, with '一' meaning 'a' or 'one' amongst many things. Yet, in the given sentence using the '一' and '就' constuct '一' has a meaning akin to 'as soon as' or 'once', while '就' implies a 'then' relation, both of which can be difficult to capture. Figure  1 4 shows an example where word alignment failed to map the 'as soon as ... then' structure to ...一... 就... . That is, columns 7, 8, 9, which represent 'as soon as' in the English have no alignment points whatsoever. Yet, in this case, all three items should be aligned to the single element '一' which is on row 1 on the Chinese side. Additionally, the word 'returns' (column 11), which is currently aligned to '一' (row 1) should in fact be aligned to '回来' (return/come back) in row 2. This misalignment could be a direct side-effect of having no alignment for 'as soon as' in the first place. Consequently, the knock-on effect of poor word alignment, especially around markers -as in this case, will lead to the overall generation of poorer translation rules.
Ex (4) 他因为病了, 所以他没来上课。 he because ill, so he not come class. Because he was sick, he didn't come to class. He is ill, so he did not come to class. (Bing) Ex4 is a modified version of Ex2, with an extra 'so'(所 以) and 'he' (他) manually inserted in the second clause of the Chinese sentence. Grammatically these extra characters are not required for the Chinese to make sense, but are still correct. However, the interesting point is that the extra information (namely 'so' and 'he') has enabled the system to produce a much better final translation.
From the given examples it appears that both implicitation and the use of specific DM structures can cause problems when generating automated translations. The highlighted issues suggest that making markers (and possibly, by extension, pronouns) explicit, due to linguistic clues, more information becomes available, which can support the extraction of word alignments. Although making implicit mark-ers explicit can seem unnatural and even unnecessary for human readers, it does follow that if the word alignment process is made easier by this explicitation it will lead to better translation rules and ultimately better translation quality.

Experiments and Word Alignments
This section examines the current ongoing research and experiments that aim to measure the extent of the difficulties caused by DMs. In particular the focus is on automated word alignments and problems around implicit and misaligned DMs. The work discussed in Section 3 highlighted the importance of improving word alignments, and especially how missing alignments around markers can lead to the generation of poorer rules.
Before progressing onto the experiments an initial baseline system was produced according to detailed criteria (Chiang, 2007;Saluja et al., 2014). The initial system was created using the ZH-EN data from the BTE parallel corpus (Paul, 2009)

(Section 3).
Fast-Align is used to generate the word alignments and the CDEC decoder (Dyer et al., 2010) is used for rule extraction and decoding. The baseline and subsequent systems discussed here are hierarchical phrase-based systems for Chinese to English translation.
Once the alignments were obtained the next step in the methodology was to examine the misalignment information to determine the occurrence of implicit markers. A variance list was created 5 that could be used to cross-reference discourse markers with appropriate substitutable words (as per Table  1). Each DM was then examined in turn (automatically) to look at what it had been aligned to. When the explicit English marker was aligned correctly, according to the variance list, then no change was made. If the marker was aligned to an unsuitable word, then an artificial marker was placed into the Chinese in the nearest free space to that word. Finally if the marker was not aligned at all then an artificial marker was inserted into the nearest free space    Table 2 shows the misalignment percentages for the four given DMs across the three corpora. The average sentence length in the BTE Corpus is eight units, in the FBIS corpus it is 30 units, and in the TED corpus it is 29 units. The scores show that there is a wide variance in the misalignments across the corpora, with FBIS consistently having the highest error rate, but in all cases the percentage is fairly significant.
Initially tokens were inserted for single markers at a time, but then finally with tokens for all markers inserted simultaneously. Table 3 shows the BLEU scores for all the experiments. The first few experiments showed improvements over the baseline of up to +0.30, whereas the final one showed improvements of up to +0.44, which is significant.
After running the experiments the visualisation of a number of word alignments (as per Figures 1,2,3) were examined and a single example of a 'then' sentence was chosen at random. Figure 2 shows the word alignments for a sentence from the baseline system, and Figure 3 shows the word alignments for the same sentence, but with an artificial marker automatically inserted for the unaligned 'then'.
The differences between the word alignments in the figures are subtle, but positive. For example, in Figure 3 more of the question to the left of 'then' is captured correctly. Moreover, to the right of 'then', 'over' has now been aligned quite well to '那 边' (over there) and 'to' has been aligned to '请 到' (please -go to). Perhaps most significantly though is the mish-mash of alignments to 'washstand' in Figure 2 has now been replaced by a very good alignment to '盥洗盆' (washbasin/washstand) showing an overall smoother alignment. These preliminary findings indicate that there is plenty of scope for further positive investigation and experimentation.

Ongoing Work
This section outlines the two main research areas (Section 1) that will be tackled in order to feed into the final thesis. Having addressed the limitations of current SMT approaches, the focus has moved on to looking at cohesive devices at the sentential level, but ultimately the overall aim is to better model DRs across wider discourse segments.

Modelling Cohesive Devices Within Sentences
Even at the sentence level there exists a local context, which produces dependencies between certain words. The cohesion information within the sentence can hold vital clues for tasks such as pronoun resolution, and so it is important to try to capture it. Simply looking at the analysis in Section 4 provides insight into which other avenues should be explored for this part, including: • Expanding the number of DMs being explored, including complex markers (e.g. as soon as).
• Improving the variance list to capture more variant translations of marker words. It is also important here to include automated filtering for difficult DMs (e.g. cases where 'and' or 'so' are not being used as specific markers can perhaps make them more difficult to align). Making significant use of parts of speech tagging and annotated texts could be useful.
• Develop better insertion algorithms to produce an improved range of insertion options, and reduce damage to existing word alignments.
• Looking at using alternative/additional evaluation metrics and tools to either replace or complement BLEU. This could produce more targeted evaluation that is better at picking up on individual linguistic components such as DMs and pronouns.
However, the final aim is to work towards a true prediction model using parallel data as a source of annotation. Creating such a model can be hard monolingually, whereas a bilingual corpus can be used as a source of additional implicit annotation or indeed a source of additional signals for discourse relations. The prediction model should make the word alignment task easier (through either guiding the process or adding constraints), which in turn will generate better translation rules and ultimately should improve MT.

Modelling Discourse Relations Across Sentences
This part will be an extension of the tasks in Section 5.1. The premise is that if the discourse information or local context within a sentence can be captured then it could be applied to wider discourse segments and possibly the whole document. Some inroads into this task have been trialled through using lexical chaining (Xiong et al., 2013b). However, more recently tools are being developed enabling document wide access to the text, which should provide scope for examining the links between larger discourse units -especially sentences and paragraphs.

Conclusions
The findings in Section 3 highlighted that implicit cohesive information can cause significant problems for MT and that by adding extra information translations can be made smoother. Section 4 extended this idea and outlined the experiments and methodology used to capture some effects of automatically inserting artificial tokens for implicit or misaligned DMs. It showed largely positive results, with some good improvements to the word alignments, indicating that there is scope for further investigation and experimentation. Finally, section 5 highlighted the two main research areas that will guide the thesis, outlining a number of ways in which the current methodology and approach could be developed. The ultimate aim is to use bilingual data as a source of additional clues for a prediction model of Chinese implicit markers, which can, for instance, guide and improve the word alignment process leading to the generation of better rules and smoother translations.